Homework 8: Twitterbot

LinkedIn is for the people you know. Facebook is for the people you used to know. Twitter is for people you want to know.
— Unknown
- Javadocs for provided code if you want to read it in a browser instead of IntelliJ
- CIS 120 Java programming style guide
- Frequently Asked Questions
- The homework files are here.
Assignment Overview
In this assignment, you will learn how to read tweets from CSV files into a program, build your own machine learning model, and create an AI that generates (somewhat) realistic tweets! Before we jump into the code or the specifics of the project, there are a few concepts you should familiarize yourself with:
Task 0: Understanding the Problem
Concept: Training Data
In machine learning and artificial intelligence, training data is data you input into a model so it ‘learns’ how to do something. For example, to build an AI that can tell if an image contains a cat, you might have to first train it with thousands of images and tell it which ones contain cats and which ones don’t. That way, the model can extract patterns or features about images that contain cats, and then use them in guessing about an image it has never seen before. The thousands of photos input to the model would be the ‘training data’ in this example. In this project, the training data will be collections of tweets obtained from real twitter data.
Concept: Data Cleaning
Generally speaking, higher quality training data produces higher quality results. Now, what exactly makes training data ‘high quality’ depends on the model you are using and the task at hand. For this tweet bot, the training data will be hundreds of tweets from a real twitter account. Using our understanding of what these tweets look like and how our TwitterBot will generate new tweets, we can improve the bot’s performance by filtering and cleaning the raw tweets, yielding ‘cleaned’ training data with which to build the model. For example, we don’t want any URLs to show up in our processed data so we filter out all the tweets that contain URLs.
Concept: Regular Expression
To filter out undesirable tweets, we specify patterns of string data that are considered “bad”. A regular expression is a simple way to define such patterns.
Concept: Markov Chain
The specific model you will use in this assignment is called a Markov Chain. To train your Markov Chain, you’ll pass in tweets as sequences of words, which will then be analyzed in pairs so that your model can figure out which words should follow which other words. Don’t worry if this doesn’t make sense to you yet! We’ll go into a lot more detail about how this works later in the assignment.
Concept: CSV File
CSV stands for Comma Separated Value. It’s a commonly used file format for storing data. A properly formatted CSV file will have the following shape:
line1_value1, line1_value2, ... line1_valueN
line2_value1, line2_value2, ... line2_valueN
...
lineK_value1, lineK_value2, ... lineK_valueN
As you can see, each line of the file has different values that are, well, separated by commas. CSVs are commonly used to represent tables. Each line in the file would be a row in a table, and each value in the line would represent a cell in the table. For example, if I wanted to use a CSV as a phonebook, I’d write something like:
Tupac Shakur, New York City, 212-779-1269, tupac@deathrowrecords.com
John Lennon, Liverpool, +44 151-496-0367, john@johnlennon.com
Prince Rogers Nelson, Minneapolis, 612-200-5655, contact@prince.com
...
This is great because since each “column” contains the same type of data, it’s easy to write code to extract whatever you need from the file. In this particular assignment, we have provided several CSV files for you to try out your code on. You can open a CSV file using notepad or another text editor, where you see the data like above in a CSV format. We recommend not using Excel to avoid possible formatting changes.
In this assignment, one of your tasks will involve figuring out how to extract just the raw tweets
from the file. Then, you will work with that raw data in various ways, including cleaning and
validating to improve the performance of the TwitterBot
(see the Concept: Training Data and Data
Cleaning sections).
File Breakdown
The list below gives a brief description of the provided source files. The ones marked by * are the only ones you need to modify.
-
*FileLineIterator.java - This class is a wrapper around the Java I/O class BufferedReader (see lecture slides here) that provides a simple way to work with file data line-by-line. By writing this class, you are creating a simple, nifty file reading utility for yourself.
-
*TweetParser.java - This class reads in tweet data from a file, and then cleans, filters and formats the data to improve the quality of the training data. Its interface also makes it easy for you to add that cleaned data to the Markov Chain.
-
*MarkovChain.java - You will implement a Markov Chain in this class. The
MarkovChain
will count all the times that certain words follow other words in the training data. In addition to storing that data, it implementsIterator
, and does so to provide a convenient way to continuously pick successive random elements. This means aMarkovChain
instance is an iterator, which you can callnext()
andhasNext()
on to generate words for your new tweet! -
*TwitterBot.java - This is the class where you put everything together. Here, you will use
TweetParser
to clean and parse tweet data, write logic for adding that data to an instance ofMarkovChain
, and then use the populatedMarkovChain
to generate tweets. -
ProbabilityDistribution.java - This class is useful for counting occurrences of things. You will use it in building
MarkovChain
. It contains aMap<T, Integer>
, where the keys are instances of objects of type T and the Integer represents the count for the number of times that object occurred. It also provides a convenient way to randomly pick one of its keys. -
NumberGenerator.java - This is an interface for any class that can generate numbers. It is used in
ProbabilityDistribution
for picking keys. -
RandomNumberGenerator.java - This class implements NumberGenerator and just generates numbers randomly.
-
ListNumberGenerator.java - This class implements
NumberGenerator
and also just generates numbers. Instead of being random, though, it takes a list of numbers as an input, so the user has control over what numbers it outputs. You don’t need to worry about using this class, but ourfixDistribution
methods use this to guarantee that ourProbabilityDistribution
outputs what we want it to.
The configuration of your project files should mirror that of the provided zip file. It has two
directories for code, named src/main/java/org/cis120
and src/test/java/org/cis120
, and one
called files that stores the CSV twitter data.
How These Files Work Together
-
We have a CSV file of tweets and the first thing we need to do is pass the file path of this CSV file to
FileLineIterator.java
, which will allow us to iterate through each line of the file. -
Then
TweetParser.java
uses thisFileLineIterator
to extract the tweet from each line.TweetParser.java
also cleans and formats each tweet and breaks each tweet down into sentences and then words. So after all the processing thatTweetParser
does, we end up with a list of sentences, where each sentence is a list of words. -
We then give this list to an instance of
MarkovChain
, which adds these words to a dictionary as bigrams. The dictionary maps from a word, w, to aProbabilityDistribution
, which stores the number of times another word, m, comes after w in our tweets. -
ProbabilityDistribution.java
has a functionpick()
, which returns one of words, m, that follow word w. This function uses aNumberGenerator
to choose which word to return.MarkovChain.java
uses thispick()
function to generate the words in our new tweet. -
Finally, we have
TwitterBot.java
, the class that puts all of this together.TwitterBot
has an instance ofMarkovChain
, which we train with the input reader of tweets. Since the instance ofMarkovChain
is an iterator, we can repeatedly callnext()
on theMarkovChain
to keep generating words for our new tweet. Every timenext()
is called, theMakovChain
returns a word that is most likely to follow the word that we generated before it.
Here are some checkpoints for this assignment. This is a long assignment, so try to do a little each day.
-
Checkpoint 1:
FileLineIterator.java
andTweetParser.java
. These, together, are about 30% of the homework assignment. They should be doable after seeing exceptions and I/O in lecture. -
Checkpoint 2:
MarkovChain.java
. This is about 50% of the homework assignment. This will take some time to think about, because it involves two overarching tasks: (1) implementing the training functionality and (2) implementing the iteration functionality. -
Checkpoint 3:
TwitterBot.java
. This is about 20% of the homework assignment, but mostly relies on combining the things you’ve already produced.
Class and Task Diagram

Task 1: Testing
Now that you have a better understanding of the problem, you will want to write some tests to cement
your knowledge. We have provided you with the following classes: FileLineIteratorTest.java
,
TweetParserTest.java
, MarkovChainTest.java
, and TwitterBotTest.java
. These are in the test
folder of the downloadable zip file. Make sure to write your tests in those files. We have provided
you with example tests as a model for your own. Before tackling (the parts of) each of the next
tasks, it is a good idea to write some test cases. Run the tests frequently as you develop your
code.
You may find it useful to test that your code will throw an expected exception. To do so, check out the How to Write Failing Tests section in our Java testing guide here.
Task 2: FileLineIterator
As you know, you need to read in tweet data from a file. So, you’re going to have to do some file I/O. Java provides many classes for working with I/O, a list of which can be found here. From that list, BufferedReader is a good choice for simple, easy, line-by-line file reading.
The goal for this task is to create an Iterator over the lines of a file. In other words, the
Iterator should read one line at a time. Every iterator must implement two methods: hasNext()
and
next()
. The method hasNext()
returns true if there is another element in the iterator and false
otherwise. next()
returns the next element of the Iterator or throws a NoSuchElementException
if
none exists (i.e. hasNext()
is false).
Implement these two methods and the constructor so that a FileLineIterator
will return the next
line of a file as a String every time the user calls next()
. You may want to refer to the
Iterator Javadocs for more
information on iterators. Note: Remember to close your readers and writers in this class and
throughout the homework.
Task 3: TweetParser
The purpose of this class is to take a reader containing tweet data as an input and transform that
data into a form that A) can be used to easily build a Markov Chain and B) optimizes the performance
of the Markov Chain. TweetParser
provides crucial functionality through its main public method,
csvDataToTrainingData()
. We represent that training data contained in a tweet as objects of type
List<List<String>>
. Each inner list of Strings represents a sentence that has been cleaned
(punctuation and other non-word data removed) and divided into words (each is just a string). A
whole tweet’s data is a list of such sentences. For example, a tweet represented by the String "CIS 120 is great! CIS 120 is fun."
would be turned into training data shaped like the following (using
OCaml list notation for readability): [ ["cis"; "120"; "is"; "great"] ; ["cis"; "120"; "is"; "fun"] ]
Creating the training data works in a couple of stages. First each reader is converted into a list
of tweets. Then each tweet is filtered to remove bad strings (like URLs) and then broken up into a list sentences. Each sentence is itself cleaned up: converting words to lower case and dropping non-word data (like symbols). Each of these steps is represented by a different method of TweetParser
.
We recommend starting at extractColumn
and working your way down. Make sure to use the methods
you’ve already written as you go writing new methods!
You should use FileLineIterator
when writing csvDataToTweets()
. TweetParser also contains a few
other private methods, including some that you will write. They’ll be needed by
parseAndCleanTweet()
, which you should use as part of your implementation of
csvDataToTrainingData()
. You will see that the “cleaning” code uses a couple of regular
expressions defined as constants in the file. You won’t need to create such complex regular
expressions in your code, but you might want to use a (very simple!) regular expression and the
String.split
method in some places.
Task 4: MarkovChain
Welcome to the Markov Chain, which is really the backbone of the whole TwitterBot. Here, we’re going
to take the sequences of words we created in TweetParser
and extract patterns from them so that we
can generate tweets that are similar to the ones in our training data. There are two steps to
working with the Markov Chain: training it and then generating sentences
Task 4.1: Training the MarkovChain
Let’s say you wanted to try to have a computer generate Shakespeare sonnets, so you downloaded a bunch of Sonnets as training data, which is essentially just the data you want your output to resemble. How would you make your output look like your training data? You might first start by calculating the number of occurrences of each word, and then having your generator produce words based on how frequently they appear. For example, the word “and” would show up more often than “kangaroo”. While you might end up with a lot of thou’s, thee’s, and thine’s, they are likely to not be in the correct order as you are essentially just grabbing common Shakespearian words randomly. English has a clear order and structure of words: adjectives come before nouns, adverbs before verbs, etc. While hard to program this relationship for all words by hand and ensure the word choice makes logical sense (i.e: while “a fast car is driving down the street” and “a fast rock is driving down the street” are both grammatically correct, the second doesn’t make much sense… unless you are talking about The Rock, that is), we can use machine learning to find these patterns and relationships without much work. Notice above how we put elements of English into certain pairs, like adjectives before nouns. This suggests that one potential option would perhaps be to track pairs of words. In fact, this is the approach we’ll use in this assignment.
One way to implement this would be to record a list of words that come after each word we find. We can do this by looking at individual pairs of adjacent words in our training text, which we’ll call bigrams . Bigrams are going to be the basis of how our Markov Chain works, so we’ll walk you through an example of how bigrams are created to make sure the concept is clear. Say we have the following sentence, and we want to generate a list of bigrams:
"An apple is grown on an apple tree."
To make the list of bigrams, we’d start by taking the first pair of words: “An” and “apple,” and putting them together in a bigram. After that, we’d move on to the next pair: “apple” and “is.” Notice that the first and second pairs overlap. This is intentional! It allows us to represent the sentence as a linked chain of words, where we can move from one word to the next.
Here’s the list of bigrams we get from continuing the process:
("An", "apple"),
("apple", "is"),
("is", "grown"),
("grown", "on"),
("on", an"),
("an", "apple"),
("apple", "tree"),
("tree", END)
Notice that we still create a bigram that starts with the last word, even though there’s no word after it. This helps us keep track of which words take place at the end of a sentence, which we keep track of with an end marker. In your code, you’ll use null as an end marker.
Now we can use this list of bigrams to generate what we call a probability distribution (also known as “frequency data”) for each word. This probability distribution tells us how likely any word is to follow after a given word. We’ll figure this out based on the number of times one word follows the previous one in our training data. To compute this distribution, we can start by listing out every unique word in the sentence and the list of words that we’ve seen after it in our set of bigrams.
Word | Following words |
---|---|
an | apple,apple |
apple | is,tree |
is | grown |
grown | on |
on | an |
tree | END |
Let’s convert this table into a list of probability distributions. “An” is only ever followed by “apple” in our training data, so the probability distribution for “an” will specify that “apple” follows “an” 100% of the time. “Apple” was followed by “tree” once and “is” once, so we give each of those words a value of 50% in our probability distribution. Continuing with this process gives us this table:
Word | Probability distribution | Frequency Data |
---|---|---|
an | apple (100%) | apple: 2 |
apple | is (50%), tree (50%) | is: 1, tree: 1 |
is | grown (100%) | grown: 1 |
grown | on (100%) | on: 1 |
on | an (100%) | an: 1 |
tree | END (100%) | END: 1 |
Note that in our code for ProbabilityDistribution
we represent the distribution as its
corresponding frequency data, which maps strings to their frequencies. For instance, the probability
distribution associated with “apple” maps “is” to 1 and “tree” to 1. This information is stored in
the field called chain
of the MarkovChain
class.
There’s one more wrinkle: The MarkovChain also has to keep track of which words might begin a
sentence. To do so, we keep track of another probability distribution called startWords
, which
records the frequencies with which each word starts a sentence. In the simple example above, there
was just one sentence and it began with “an”, so the resulting distribution will just map “an” to 1.
Richer training sets will include more starting words.
And that’s it for training! Our Markov Chain is ready to go.
Your first coding step for this task is to implement addBigram()
. addBigram()
works by adding or
modifying records in the Markov Chain’s map of probability distributions to record a pair of
adjacent words. Note that the dictionary maps String
to ProbabilityDistribution
. This is because
for every observed word we want to keep track of a separate probability distribution (like we did in
the example above) of the words that follow it. Your next job is to implement train()
, which takes
in an iterator of words that form a sentence and updates your model based on that sequence by adding
all bigrams that occur in that sentence. You will also need to update the ProbabilityDistribution
of startWords
so that your MarkovChain
knows realistic places to start. For example, if 90% of
sentences in your training data start with the word “the,” then using a ProbabilityDistribution
to
find the start word will ensure that 90% of the sentences generated by your model start with “the.”
To complete the TwitterBot, you’ll be training your Markov Chain with a separate iterator for each sentence. This is because each sentence can be treated as a completely separate sequence of words, as we only care about which words are ordered consecutively within the same sentence. You’ll need to implement this functionality in the last step of the assignment.
Task 4.2: Generating sentences
Once the Markov Chain has been trained, we can use it to generate some entirely new sentences. We first select a start word based on the words that showed up as start words in our training set. Since we only had one sentence, which started with “an,” we have no choice but to select “an” as our start word. After that, we look at “an” in our probability distribution and randomly choose the word that goes next based on this distribution. It turns out that “apple” is the following word 100% of the time, so we’ll choose apple. So far, our sentence is “An apple.” Once we get to apple, our Markov Chain has a choice: it can choose “is” 50% of the time, or “tree” 50% of the time. Let’s say it chooses tree. We then go to the probability distribution for “tree” and see that it can only ever be at the end of the sentence. As a result, our Markov Chain terminates with the (incomplete) sentence “An apple tree.” As you can see, this process isn’t very exciting with such a small training set, but we can get some really cool results once we start to play with real-world Twitter data! (Though even then, the model isn’t sophisticated enough to truly take grammar into account.)
If you are more visually inclined (or you enjoy graph theory), you might appreciate the illustration below for how Markov Chains work. Otherwise, feel free to skip right over this, or just spend a minute dragging around the diagram below, as it’s kinda mesmerizing. Each lettered circle would represent some word in our Markov Chain. Notice the arrows from one word to another. These arrows would represent a pairing of the words in a bigram, and the size of the arrow in the diagram illustrates the relative frequency at which that word occurred after its parent word in the training data. Every time the red circle moves from one word to another, the Markov Chain generates that word in its output. This diagram already looks hard to follow, but imagine one of these diagrams being created with thousands of words (which is more reasonable for a training dataset). Generating a sentence or tweet would still be the same however, as you would just walk from one word to another based off of the frequencies of occurrence in the training data.
If you understand the idea here, then you understand a Markov Chain, because it’s simply a data
structure that tracks the number of times one thing follows another. You’ll write the code for one
and then use it in your TwitterBot
as we describe below.
The first coding step is to implement reset()
, which sets the MarkovChain
’s iterator to begin at
a new start word.
Consider the following example: you have a Markov Chain called yourMarkovChain
that creates
sentences, and yourMarkovChain
is about to return the word “apple”. You want to create a new
sentence starting with the start word “an”, so you call yourMarkovChain.reset("an");
. Before,
yourMarkovChain
was about to return “apple”. Now, it’s about to return “an”, and subsequent calls
to yourMarkovChain.next()
will continue along some random path started by “an” until you use
reset()
again.
Here are before and after images:
![]() |
![]() |
Cursor at E |
Cursor at C |
Note that there are two versions of reset
: one of them takes in a specific start word and the
other one chooses a random start word. You only need to implement the method that takes in a
specific start word. This method’s implementation will vary depending on how you store the
MarkovChain iterator’s current element. Consider adding a field to your class for this purpose. Look
at the lecture slides and recitation materials to get an idea of what fields iterators typically
have.
When you get the iterator for a list of words and call next()
on that iterator, you can get all
the words in that list. Similarly, because the MarkovChain
is an iterator, when you call next()
on it, you can get the words in its dictionary. However, the order that these words appear in when
you call next()
isn’t the same as the order in which they’re put into the MarkovChain
. Instead,
each word that appears is the most likely word to follow the previous word. Therefore, when we
repeatedly call next()
on a MarkovChain
, we should end up with a sentence that (hopefully) makes
sense. Because a MarkovChain
is an iterator, it also has the function hasNext()
, which allows
you to check if there is a next element.
The second step is to implement the iterator’s hasNext()
method, which should be very short.
Finally you have to write next()
. Just like the above two methods, your implementation depends on
how you store the current element in the iterator. You’ll need to use the
ProbabilityDistribution
’s methods to help you. You’ll also need to use the NumberGenerator ng
class field.
And there you go! You implemented a basic Machine Learning algorithm! Go enjoy some Insomnia Cookies or something or just stare at The Rock some more.
Task 5: TwitterBot
Now, it’s time to put everything together! In TwitterBot
, you will use all of the classes you’ve
written thus far to create a real working TwitterBot. The most important parts of this class are the
constructor, which extracts sentences from the training tweets and uses them to train the Markov
Chain, and the generateTweet()
method, which uses the algorithm we explained above in the
Concept: Markov Chain section to generate the sentences that form the new tweet.
The biggest piece of this task is implementing the generateTweet()
method. You also must implement
writeStringsToFile
, a simple method that we can use in the main
method to write the generated
tweets to a text file.
Running the TwitterBot
Now it’s time to run your TwitterBot! At the top of the TwitterBot
class, there is a class field
called pathToTweets
. This is the path to the CSV with training data that will be used to create
your TwitterBot. We have that field pre-filled to a CSV of tweets of cute things that dogs might say
(source: Dog Feelings Twitter <—- yep this data is
from a real twitter account). Feel free to change the path to other files (we provided some extra
CSVs from other twitter accounts). Note: you will have to change the columnForPath
field if you
use a different file as it might have the tweets in a different column than the dog tweets. Finally
to run the TwitterBot, just open up the TwitterBot
class and press run. You should see some tweets
being output to the console.
Submission and Grading
Submission Instructions
As with previous assignments, you will be uploading only the following files:
src/main/java/org/cis120/FileLineIterator.java
src/main/java/org/cis120/TweetParser.java
src/main/java/org/cis120/MarkovChain.java
src/main/java/org/cis120/TwitterBot.java
src/test/java/org/cis120/FileLineIteratorTest.java
src/test/java/org/cis120/MarkovChainTest.java
src/test/java/org/cis120/TweetParserTest.java
src/test/java/org/cis120/TwitterBotTest.java
If you are using Codio
These files should be organized similar to above (with a src and test
directory). The easiest option is to use the Zip Project
menu item in
Codio. Do not include any of the other provided files, since doing so may
cause your submission to fail to compile. You can directly upload the zip
produced by Codio into Gradescope.
If you are using IntelliJ
Gradescope allows you to easily drag-and-drop files into it, or you can click “Browse” in Gradescope to open up a file browser on your computer to select files. Upload only the files listed above.
Grading Breakdown
- Automated testing (90%)
- Task 2: FileLineIterator (10%)
- Task 3: TweetParser (25%)
- Task 4: MarkovChain (25%)
- Task 5: TwitterBot (25%)
- Style (5%)
- Manual grading (10%)
- Task 1: Quality of testing (10%)
You have three free submissions for this assignment. Each additional submission will cost you five points from your final score.
Attribution
This homework was created by the CIS 120 19fa Development Committee
(Alex Seidel, Angela Xi, Daniel Like, Nico Melton, Nicolas Corona, and William Goeller)