Homework 8: Twitterbot

LinkedIn is for the people you know. Facebook is for the people you used to know. Twitter is for people you used to want to know. — Unknown
- Javadocs for provided code if you want to read it in a browser instead of IntelliJ
- CIS 1200 Java programming style guide
- Frequently Asked Questions
- The homework files are here.
Assignment Overview
In this assignment, you will learn how to read tweets from ‘CSV files’ into a program, build your own machine learning model, and create an AI that generates (somewhat) realistic tweets!
For example, when given the data in the captain_markov_tweets.csv
file, your TwitterBot might
(randomly) generate outputs like the following, among many other (most even crazier) random outputs:
Captain's log supplemental. It's been destroyed to respond to a routine procedure
Captain's log stardate 5275. At Jason's request I have been attacked by an object on having it. 3. We have mystical healing powers. 1. Our ship is on it may be a new synthetic food which would not as planned. Now make that noise!
Captain's log. Our mission?
@baconshire I think that was a high warp core. 6. The Ferengi have mystical healing powers.
Captain's log stardate 45349. Unmanned probes have left with no consciousness. We can't do not as an astonishing rate. We have failed.
Before we jump into the code or the specifics of the project, there are a few concepts you should familiarize yourself with…
Task 0: Understanding the Problem
Concept: Training Data
In machine learning and artificial intelligence, training data is data you input into a model so it ‘learns’ how to do something. For example, to build an AI that can tell if an image contains a cat, you might have to first train it with thousands of images and tell it which ones contain cats and which ones don’t. That way, the model can extract patterns or features about images that contain cats and then use them in guessing about an image it has never seen before. The thousands of photos given as input to the model would be the ‘training data’ in this example. In this project, the training data will be collections of tweets obtained from real twitter data.
Concept: Data Cleaning
Generally speaking, higher quality training data produces higher quality results. Now, what exactly makes training data ‘high quality’ depends on the model you are using and the task at hand. For this tweet bot, the training data will be hundreds of tweets from a real twitter account. Using our understanding of what these tweets look like and how our TwitterBot will generate new tweets, we can improve the bot’s performance by filtering and cleaning the raw tweets, yielding ‘cleaned’ training data with which to build the model. For example, we don’t want any URLs to show up in our processed data, so we filter out all the tweets that contain URLs.
Concept: Regular Expression
To filter out undesirable tweets, we specify patterns of string data that are considered “bad”. A regular expression is a simple way to define such patterns. We can also use regular expressions to characterize the valid “tokens” (i.e. words or punctuation) that can be found in a tweet.
Concept: Markov Chain
The specific model you will use in this assignment is called a Markov Chain. To train your Markov Chain, you’ll pass in tweets as sequences of tokens, which will then be analyzed in pairs so that your model can figure out which tokens should follow which other tokens. (Don’t worry if this doesn’t make sense yet – we’ll go into a lot more detail about how it all works later in the assignment.) You can think of a Markov Chain as a baby version of today’s fancy LLM models like ChatGPT.
Concept: CSV File
CSV stands for Comma Separated Value. It’s a commonly used file format for storing data. A properly formatted CSV file will have the following shape:
line1_field1, line1_field2, ... line1_fieldN
line2_field1, line2_field2, ... line2_fieldN
...
lineK_field1, lineK_field2, ... lineK_fieldN
As you can see, each line of the file has different fields, separated by commas. CSVs are commonly used to represent tables. Each line in the file would be a row in a table, and each value in the line would represent a cell in the table. For example, if we wanted to use a CSV as a phonebook, we’d write something like:
Tupac Shakur, New York City, 212-779-1269, tupac@deathrowrecords.com
John Lennon, Liverpool, +44 151-496-0367, john@johnlennon.com
Prince Rogers Nelson, Minneapolis, 612-200-5655, contact@prince.com
...
This is great because, since each “column” contains the same type of data, it’s easy to write code to extract whatever you need from the file. In this particular assignment, we have provided several CSV files for you to try out your code on. You can open a CSV file using notepad or another text editor, where you see the data like above in the CSV format. However, we recommend NOT using Excel to edit these CSV files, to avoid possible formatting changes.
One tricky bit of the CSV format is that sometimes we want to include the special delimiting character ‘,’ as part of the data in a field. To do that, CSV allows for “quoting” commas using (double) quotation marks. For example, the second field of the following line of CSV data has a such commas:
field1,"this, lovely, field has a comma", no commas here
In this assignment, one of your tasks will involve figuring out how to extract just the raw tweets
from a CSV file. Then you will work with that raw data in various ways, including cleaning and
validating to improve the performance of the TwitterBot
(see the Concept: Training Data and
Concept: Data Cleaning sections above).
File Breakdown
The list below gives a brief description of the provided source files. The ones marked by * are the only ones you need to modify.
-
*CSV.java - Provides static methods for working with CSV data.
-
*FileUtilities.java - Provides static methods for writing and reading to files.
-
*LineIterator.java - This class is a wrapper around the Java I/O class BufferedReader (see lecture slides here) that provides a simple way to work with file data line-by-line. By writing this class, you are creating a simple, nifty file reading utility for yourself.
-
TweetParser.java - This class reads tweet data from a file and then cleans, filters, and formats the data to improve its usefulness as training data. Its interface also makes it easy for you to add that cleaned data to the Markov Chain.
-
*MarkovChain.java - You will implement a Markov Chain in this class. The
MarkovChain
will count all the times that certain words follow other words in the training data. In addition to storing that data, it also provides an iterator, using theMarkovChainIterator
class. -
*MarkovChainIterator.java - This class works with the Markov Chain to provide a way of “walking” through the data to generate (random) similar outputs. It provides this functionality as an
Iterator
on which you can usenext()
andhasNext()
to generate words for your new tweet! -
*TwitterBot.java - This is the class that holds a (trained)
MarkovChain
and uses it to generate tweets, mostly by inserting appropriate spacing to generate “realistic” sentences. -
TwitterBotMain.java - This is where everything is put everything together. Here, the processing pipeline of CSV -> TweetParser -> TwitterBot -> Output is made explicit in the code. You can change the provided inputs to generate tweets from different sources of CSV data.
-
ProbabilityDistribution.java - This class is useful for counting occurrence frequencies of things. You will use it in building
MarkovChain
. It contains aMap<T, Integer>
, where the keys are instances of objects of typeT
and theInteger
represents the count for the number of times that object occurred. It also provides a convenient way to randomly pick one of the keys based on the frequency distribution. -
NumberGenerator.java - This is an interface that applies to any class that can generate numbers. A
NumberGenerator
provides the source of choices that determine a “walk” through the MarkovChain. -
RandomNumberGenerator.java - This class implements
NumberGenerator
and just generates numbers randomly – it is used to generate truly random outputs from the trained MarkovChain. -
ListNumberGenerator.java - This class also implements
NumberGenerator
. Instead of generating numbers randomly, however, it takes a list of numbers as an input, giving exact control over the sequence of numbers it outputs. That control makes it easy to test that theMarkovChain
is working correctly.
The configuration of your project files should mirror that of the provided zip file. It has two
directories for code, named src/main/java/org/cis1200
and src/test/java/org/cis1200
, and one
more directory named files
that stores the CSV twitter data.
How Everything Works Together
The file TwitterBotMain.java
shows the sequence of steps needed to take a CSV file, train a Markov
Model, and use it to produce interesting tweets. Although you do not need to modify
TwitterBotMain
(except to play with different CSV input files), you should first read through it
to see how the components of the project fit together.
Here are some checkpoints for this assignment. This is a long assignment, so try to do a little each day.
-
Checkpoint 1:
FileUtilities.java
andLineIterator.java
use theBufferedReader
andIterator
types. This part should be doable after seeing exceptions and I/O in lecture. You can first familiarize yourself with I/O by implementing the functions inFileUtilities.java
, then proceed to completeLineIterator.java
. -
Checkpoint 2:
CSV.java
provides aparseRecord
method that turns a line of CSV data into its constituent fields. This is a slightly trickly algorithm to implement because of the “quotes” and “commas” conventions used in CSV, but doesn’t depend on any new concepts. -
Checkpoint 3:
MarkovChain.java
andMarkovChainIterator.java
. This part is about 50% of the homework assignment. It will take some time to think about, because it involves two overarching tasks: (1) implementing the training functionality and (2) implementing the iteration functionality. -
Checkpoint 4:
TwitterBot.java
. This is a fairly easy part of the homework assignment, because it mostly relies on using thegetWalk
method provided byMarkovChain
and combining the results into a valid output.
Task 1: Testing (Overview)
As you work through the rest of the assignment, you should start each part by writing test cases to cement your knowledge. We have provided you with the following classes, several of which include many test cases that you can use to help understand the specification of the problem. You are expected to write additional test cases for the three classes marked with *, which you will submit with your assignment.
-
CSVTest.java - provides comprehensive coverage of CSV corner cases, for your use
-
FileUtilitiesTest.java
-
*LineIteratorTest.java
-
*MarkovChainTest.java - note that this file includes a partial test that you must complete for full credit
-
ProbabilityDistributionTest.java
-
TweetParserTest.java
-
*TwitterBotTest.java
You should write tests before you implement each task.
Run the tests frequently as you develop your code.
You may find it useful to test that your code will throw an expected exception. To do so, check out the How to Write Failing Tests section in our Java testing guide here.
Task 2: FileUtilities and LineIterator
As you know, you need to read in tweet data from a file. So, you’re going to have to do some file
I/O. Java provides many classes for working with I/O, a list of which can be found
here. Out
of all of these BufferedReader
is a good choice for simple, easy, line-by-line file reading. The
class LineIterator
asks you to implement an Iterator
as a “wrapper” class around a
BufferedReader
. These two files will exercise your ability to properly handle exceptions.
Task 3: CSV
This class provides static methods for working with CSV data. CSV data is a line-oriented data
format, so these methods work with either BufferedReader
(for multiline CSV sources) or String
(for working with one CSV record at a time).
The core operation is called parseRecord
, which takes a CSV record and breaks it up into fields,
following the CSV format conventions for usage of commas and quotation marks. See the comments in
that file, and the extensive provided test cases in CSVTest.java
for more details about what is
expected here.
Kudos: Parsing quoted quotes
The kudos problem focuses on parsing CSV data with nested quotes, where double quotes within
a field are denoted by two consecutive double quotes. Additional details and examples can be
found in CSV.java
, CSVKudosTest.java
, and quotes_and_commas.csv
.
Task 4: MarkovChain and MarkovChainIterator
Welcome to the Markov Chain, which is really the backbone of the whole TwitterBot. Here, we’re going
to take the sequences of tokens we created in TweetParser
and extract patterns from them so that
we can generate tweets that are similar to the ones in our training data. There are two steps to
working with the Markov Chain: training it and then generating sentences. (A Markov Chain is an
extremely simplified version of ChatGPT. GPT and related “large language models” use much more
sophisticated neural networks that are trained on vast amounts of data to predict the next word,
whereas a Markov Chain uses simple frequency data.)
Task 4.1: Training the MarkovChain
Let’s say you wanted to try to have a computer generate Shakespeare sonnets, so you downloaded a bunch of Sonnets as training data. How would you make your output look like your training data? You might start by calculating the number of occurrences of each word, and then having your generator produce words based on how frequently they appear. For example, the word “and” would show up more often than “kangaroo”. While you might end up with a lot of thou’s, thee’s, and thine’s, they are likely to not be in the correct order as you are essentially just grabbing common Shakespearian words randomly. English has a clear order and structure of words: adjectives come before nouns, adverbs before verbs, etc. While hard to program this relationship for all words by hand and ensure the word choice makes logical sense (i.e: while “a fast car is driving down the street” and “a fast rock is driving down the street” are both grammatically correct, the second doesn’t make much sense… unless you are talking about The Rock, of course), we can use machine learning to find these patterns and relationships without much work. Notice above how we put elements of English into certain pairs, like adjectives before nouns. This suggests that one potential option would be to track pairs of words. In fact, this is the approach we’ll use in this assignment.
One way to implement this would be to record a list of words that come after each word we find. We can do this by looking at individual pairs of adjacent words in our training text, which we’ll call bigrams . Bigrams are going to be the basis of how our Markov Chain works, so we’ll walk you through an example of how bigrams are created to make sure the concept is clear. Say we have the following sentence, and we want to generate a list of bigrams:
"an apple is grown on an apple tree."
To make the list of bigrams, we’d start by taking the first pair of words: “an” and “apple,” and putting them together in a bigram. After that, we’d move on to the next pair: “apple” and “is.” Notice that the first and second pairs overlap. This is intentional! It allows us to represent the sentence as a linked chain of words, where we can move from one word to the next. (That is why this approach is called a “Markov Chain”.)
Here’s the list of bigrams we get from continuing the process:
("an", "apple"),
("apple", "is"),
("is", "grown"),
("grown", "on"),
("on", an"),
("an", "apple"),
("apple", "tree"),
("tree", "."),
(".", <END>)
Note that we treat punctuation as its own “token”, so "tree"
is followed by "."
. In general, we
will train the Markov Chain on tokens that are either words (by which we mean strings consisting
of alphanumeric characters, plus the characters #
, @
, and '
) or punctuation (by which we
mean all other non-whitespace symbols). We treat different capitalizations of the same word as
different tokens, so “An” and “an” would be considered to be different – this distinction is useful
because the Markov Chain can learn that capitalized words are more likely to start tweets and follow
punctuation tokens.
Observe that we still create a bigram that starts with the last token, which in this case is the
token “.”, even though there’s no token after it. This helps us keep track of which tokens occur at
the end of a tweet. We keep track of this with an end marker, which is the constant END_TOKEN = "<END>"
.
Now we can use this list of bigrams to generate what we call a probability distribution (also known as “frequency data”) for each word. This probability distribution tells us how likely any word is to follow after a given word. We’ll figure this out based on the number of times one word follows the previous one in our training data. To compute this distribution, we can start by listing out every unique word in the sentence and the list of words that we’ve seen after it in our set of bigrams.
Token | Following tokens |
---|---|
an | apple,apple |
apple | is,tree |
is | grown |
grown | on |
on | an |
tree | . |
. | <END> |
Let’s convert this table into a list of probability distributions. “an” is only ever followed by “apple” in our training data, so the probability distribution for “an” will specify that “apple” follows “an” 100% of the time. “apple” was followed by “tree” once and “is” once, so we give each of those words a value of 50% in our probability distribution. Continuing with this process gives us this table:
Token | Probability distribution | Frequency Data |
---|---|---|
an | apple (100%) | apple: 2 |
apple | is (50%), tree (50%) | is: 1, tree: 1 |
is | grown (100%) | grown: 1 |
grown | on (100%) | on: 1 |
on | an (100%) | an: 1 |
tree | . (100%) | .: 1 |
. | <END> (100%) | <END>: 1 |
Note that in our code for ProbabilityDistribution
we represent the distribution as its
corresponding frequency data, which maps strings to their frequencies. For instance, the probability
distribution associated with “apple” maps “is” to 1 and “tree” to 1. This information is stored in
the field called bigramFrequencies
of the MarkovChain
class.
There’s one more small wrinkle: The MarkovChain also has to keep track of which words might begin
a sentence. To do so, we keep track of another probability distribution called startTokens
, which
records the frequencies with which each word starts a sentence. In the simple example above, there
was just one sentence and it began with “an”, so the resulting distribution will just map “an” to 1.
Richer training sets will include more starting words.
Your first coding step for this task is to implement addBigram()
, which works by adding or
modifying records in the Markov Chain’s map of probability distributions to record a pair of
adjacent words. The record()
method from the ProbabilityDistribution
class can be useful here.
Note that the dictionary maps String
to ProbabilityDistribution
. This is
because, for every observed word, we want to keep track of a separate probability distribution (like
we did in the example above) of the words that follow it. Your next job is to implement addSequence()
,
which takes in an iterator of tokens that form a tweet and updates your model based on that
sequence by adding all bigrams that occur in that sentence. You will also need to update the
ProbabilityDistribution
of startTokens
so that your MarkovChain
knows realistic places to
start. For example, if 90% of sentences in your training data start with the word “the,” then using
a ProbabilityDistribution
to find the start word will ensure that 90% of the sentences generated
by your model start with “the.”
Finally, complete the (non-default) constructor for the MarkovChain
class, which takes a whole
collection of training data and builds a fully trained model. It should simply iterate through
all of the sequences and add each of them to the model.
And that’s it for training! Our Markov Chain is ready to go.
Task 4.2: Generating sentences
Walks through the Markov Chain
Once the Markov Chain has been trained, we can use it to generate some entirely new tweets. We first select a start token based on the tokens that showed up as start tokens in our training set. Since we only had one sentence, which started with “an,” we have no choice but to select “an” as our start token. After that, we look at “an” in our probability distribution and randomly choose the token that goes next based on this distribution. It turns out that “apple” is the following token 100% of the time, so we’ll choose apple. So far, our sentence is “an apple.” Once we get to apple, our Markov Chain has a choice: it can choose “is” 50% of the time and “tree” 50% of the time. Let’s say it chooses “tree”. We then go to the probability distribution for “tree” and see that it can only ever be followed by the token “.” at the end of the sentence. As a result, our Markov Chain terminates with the (incomplete) sentence “An apple tree.” As you can see, this process isn’t very exciting with such a small training set, but we can get some cool results once we start to play with real-world Twitter data! (Though even then, the model isn’t sophisticated enough to truly take grammar into account.)
This process of choosing a start token and then following the probability distribution until we find
the END token is called a walk through the Markov Chain. We can represent a walk as an
Iterator
that provides the next
and hasNext
operations, such that each call to next
yields
the next token encountered along the walk. The MarkovChain
class provides a method called
getWalk
that returns such an MarkovChainIterator
.
For a truly random walk, as described above, the MarkovChainIterator
would choose the sequence of
tokens arbirarily from the probability distribution of the underlying Markov Chain data set.
However, such randomized algorithms are very difficulty to test! Instead, the getWalk
method takes
a parameter, NumberGenerator ng
, which acts as a “source of choices” that are used to guide the
walk along a path through the Markov Chain.
A NumberGenerator
itself is a bit like an Iterator
: it provides just one operation, int next(int bound)
, that returns an integer x
satisfying 0 <= x < bound
. Like an Iterator
, a
NumberGenerator
can be called multiple times to get a series of numbers.
To use the NumberGenerator
to produce a walk, getWalk
creates an instance of
MarkovChainIterator
that uses those numbers as the “choices” made at each step of the way. The
MarkovChainIterator
should maintain some state that records the next token along the chosen path
and update it at each call to next
. When selecting the starting token to begin the walk, the
MarkovChainIterator
uses the number generator to pick from among all of the possibilities in
startTokens
. Thereafter, it uses the number generator to pick from among the possible next
options, following the probability distributions according to the Markov Chain bigramFrequencies
.
The coding for this task is to implement the Iterator
interface as required by
MarkovChainIterator
, following the strategy above. First complete the constructor, which should
initialize the ng
field and pick the starting token of the path. You will need to add at least
one additional field to the MarkovChainIterator
class – think about what invariant might be
useful to maintain for that field. Next, implement the hasNext
method, which should be quite
straightforward if you have chosen a good invariant. Finally, implement the next()
method, which
should yield the next token along the path and set up subsequent calls to next
and hasNext
.
This should involve using the pick
operation of ProbabilityDistribution
.
See the detailed comments about the illustrative example in MarkovChain.java
for more details
about this task.
And there you go! You implemented a basic Machine Learning algorithm! Go enjoy some Insomnia Cookies or something or just stare at The Rock some more.
A visual explanation of the Markov Chain structure
If you are more visually inclined (or you enjoy graph theory), you might appreciate the illustration
below for how Markov Chains work. Otherwise, feel free to skip right over this, or just spend a
minute dragging around the diagram below, as it’s kinda mesmerizing. Each lettered circle would
represent some word in our Markov Chain. Notice the arrows from one word to another. These arrows
would represent a pairing of the words in a bigram, and the size of the arrow in the diagram
illustrates the relative frequency at which that word occurred after its parent word in the training
data. Every time the highlighted circle moves from one word to another, the Markov Chain generates
that word in its output–each transition corresponds to one call to next
. This diagram already
looks hard to follow, but imagine one of these diagrams being created with thousands of words (which
is more reasonable for a training dataset). Generating a sentence or tweet would still be the same
however, as you would just walk from one word to another based off of the frequencies of occurrence
in the training data.
Task 5: TwitterBot
Now, it’s time to actually generate a tweet! The TwitterBot
class provides a constructor that
instantiates a trained MarkovChain
from some training data. Your only part in this task is
implementing the generateTweet()
method, which simply uses the getWalk
functionality to produce
a sequence of tokens that should be concatenated, with appropriate spaces added, to form a complete
tweet. This is fairly straight forward process: you should simply use a StringBuilder
along with a
bit of logic to determine whether to add a space or not. That’s it!
Hint: you can use the matches
method to determine whether a token is punctuation.
Running the TwitterBot
Now it’s time to run your TwitterBot! At the top of the TwitterBotMain
class, which provides the
whole TwitterBot pipeline, there is a class field called PATH_TO_TWEETS
. This is the path to the
CSV with training data that will be used to create your TwitterBot.
We have that field pre-filled to a CSV of tweets based on the illustrative example from
MarkovChain.java
, but you can try it out on other examples too.
One fun source is dog_feelings_tweets.csv
, which is cute things that dogs might say (source: Dog
Feelings Twitter <—- yep this data is from a real
twitter account). Feel free to change the path to other files (we provided some extra CSVs from
other twitter accounts). Note: you will have to change the columnForPath
field if you use a
different file as it might have the tweets in a different column than the dog tweets.
Finally to run the TwitterBot, just open up the TwitterBotMain
class and press run. You should see
some tweets being output to the console.
Submission and Grading
Submission Instructions
You will be uploading only the following files:
src/main/java/org/cis1200/CSV.java
src/main/java/org/cis1200/FileUtilities.java
src/main/java/org/cis1200/LineIterator.java
src/main/java/org/cis1200/MarkovChain.java
src/main/java/org/cis1200/MarkovChainIterator.java
src/main/java/org/cis1200/TwitterBot.java
src/test/java/org/cis1200/LineIteratorTest.java
src/test/java/org/cis1200/MarkovChainTest.java
src/test/java/org/cis1200/TwitterBotTest.java
If you are using Codio
These files should be organized similar to above (with a src and test directory). The easiest
option is to use the Zip Project
menu item in Codio. Do not include any of the other provided
files, since doing so may cause your submission to fail to compile. You can directly upload the zip
produced by Codio into Gradescope.
If you are using IntelliJ
Gradescope allows you to easily drag-and-drop files into it, or you can click “Browse” in Gradescope to open up a file browser on your computer to select files. Upload only the files listed above.
Grading Breakdown
- Automated testing (90%)
- Task 2: FileUtilities and LineIterator (16%)
- Task 3: CSV (18%)
- Task 4: MarkovChain and MarkovChainIterator (45%)
- Task 5: TwitterBot (6%)
- Style (5%)
- Manual grading (10%)
- Task 1: Quality of testing (10%)
You have three free submissions for this assignment. Each additional submission will cost you five points from your final score.
Attribution
This homework was created by the CIS 120 19fa Development Committee (Alex Seidel, Angela Xi, Daniel Like, Nico Melton, Nicolas Corona, and William Goeller).
It was signficantly refactored in Fall 2023 / Spring 2024 by Steve Zdancewic.