CIS 1200

Homework 8: Twitterbot

LinkedIn is for the people you know. Facebook is for the people you used to know. Twitter is for people you used to want to know. — Unknown

Assignment Overview

In this assignment, you will learn how to read tweets from ‘CSV files’ into a program, build your own machine learning model, and create an AI that generates (somewhat) realistic tweets!

For example, when given the data in the captain_markov_tweets.csv file, your TwitterBot might (randomly) generate outputs like the following, among many other (most even crazier) random outputs:

Captain's log supplemental. It's been destroyed to respond to a routine procedure
Captain's log stardate 5275. At Jason's request I have been attacked by an object on having it. 3. We have mystical healing powers. 1. Our ship is on it may be a new synthetic food which would not as planned. Now make that noise!
Captain's log. Our mission?
@baconshire I think that was a high warp core. 6. The Ferengi have mystical healing powers.
Captain's log stardate 45349. Unmanned probes have left with no consciousness. We can't do not as an astonishing rate. We have failed.

Before we jump into the code or the specifics of the project, there are a few concepts you should familiarize yourself with…

Task 0: Understanding the Problem

Concept: Training Data

In machine learning and artificial intelligence, training data is data you input into a model so it ‘learns’ how to do something. For example, to build an AI that can tell if an image contains a cat, you might have to first train it with thousands of images and tell it which ones contain cats and which ones don’t. That way, the model can extract patterns or features about images that contain cats and then use them in guessing about an image it has never seen before. The thousands of photos given as input to the model would be the ‘training data’ in this example. In this project, the training data will be collections of tweets obtained from real twitter data.

Concept: Data Cleaning

Generally speaking, higher quality training data produces higher quality results. Now, what exactly makes training data ‘high quality’ depends on the model you are using and the task at hand. For this tweet bot, the training data will be hundreds of tweets from a real twitter account. Using our understanding of what these tweets look like and how our TwitterBot will generate new tweets, we can improve the bot’s performance by filtering and cleaning the raw tweets, yielding ‘cleaned’ training data with which to build the model. For example, we don’t want any URLs to show up in our processed data, so we filter out all the tweets that contain URLs.

Concept: Regular Expression

To filter out undesirable tweets, we specify patterns of string data that are considered “bad”. A regular expression is a simple way to define such patterns. We can also use regular expressions to characterize the valid “tokens” (i.e. words or punctuation) that can be found in a tweet.

Concept: Markov Chain

The specific model you will use in this assignment is called a Markov Chain. To train your Markov Chain, you’ll pass in tweets as sequences of tokens, which will then be analyzed in pairs so that your model can figure out which tokens should follow which other tokens. (Don’t worry if this doesn’t make sense yet – we’ll go into a lot more detail about how it all works later in the assignment.) You can think of a Markov Chain as a baby version of today’s fancy LLM models like ChatGPT.

Concept: CSV File

CSV stands for Comma Separated Value. It’s a commonly used file format for storing data. A properly formatted CSV file will have the following shape:

line1_field1, line1_field2, ... line1_fieldN
line2_field1, line2_field2, ... line2_fieldN
...
lineK_field1, lineK_field2, ... lineK_fieldN

As you can see, each line of the file has different fields, separated by commas. CSVs are commonly used to represent tables. Each line in the file would be a row in a table, and each value in the line would represent a cell in the table. For example, if we wanted to use a CSV as a phonebook, we’d write something like:

Tupac Shakur, New York City, 212-779-1269, tupac@deathrowrecords.com
John Lennon, Liverpool, +44 151-496-0367, john@johnlennon.com
Prince Rogers Nelson, Minneapolis, 612-200-5655, contact@prince.com
...

This is great because, since each “column” contains the same type of data, it’s easy to write code to extract whatever you need from the file. In this particular assignment, we have provided several CSV files for you to try out your code on. You can open a CSV file using notepad or another text editor, where you see the data like above in the CSV format. However, we recommend NOT using Excel to edit these CSV files, to avoid possible formatting changes.

One tricky bit of the CSV format is that sometimes we want to include the special delimiting character ‘,’ as part of the data in a field. To do that, CSV allows for “quoting” commas using (double) quotation marks. For example, the second field of the following line of CSV data has a such commas:

field1,"this, lovely, field has a comma", no commas here

In this assignment, one of your tasks will involve figuring out how to extract just the raw tweets from a CSV file. Then you will work with that raw data in various ways, including cleaning and validating to improve the performance of the TwitterBot (see the Concept: Training Data and Concept: Data Cleaning sections above).

File Breakdown

The list below gives a brief description of the provided source files. The ones marked by * are the only ones you need to modify.

  • *CSV.java - Provides static methods for working with CSV data.

  • *FileUtilities.java - Provides static methods for writing and reading to files.

  • *LineIterator.java - This class is a wrapper around the Java I/O class BufferedReader (see lecture slides here) that provides a simple way to work with file data line-by-line. By writing this class, you are creating a simple, nifty file reading utility for yourself.

  • TweetParser.java - This class reads tweet data from a file and then cleans, filters, and formats the data to improve its usefulness as training data. Its interface also makes it easy for you to add that cleaned data to the Markov Chain.

  • *MarkovChain.java - You will implement a Markov Chain in this class. The MarkovChain will count all the times that certain words follow other words in the training data. In addition to storing that data, it also provides an iterator, using the MarkovChainIterator class.

  • *MarkovChainIterator.java - This class works with the Markov Chain to provide a way of “walking” through the data to generate (random) similar outputs. It provides this functionality as an Iterator on which you can use next() and hasNext() to generate words for your new tweet!

  • *TwitterBot.java - This is the class that holds a (trained) MarkovChain and uses it to generate tweets, mostly by inserting appropriate spacing to generate “realistic” sentences.

  • TwitterBotMain.java - This is where everything is put everything together. Here, the processing pipeline of CSV -> TweetParser -> TwitterBot -> Output is made explicit in the code. You can change the provided inputs to generate tweets from different sources of CSV data.

  • ProbabilityDistribution.java - This class is useful for counting occurrence frequencies of things. You will use it in building MarkovChain. It contains a Map<T, Integer>, where the keys are instances of objects of type T and the Integer represents the count for the number of times that object occurred. It also provides a convenient way to randomly pick one of the keys based on the frequency distribution.

  • NumberGenerator.java - This is an interface that applies to any class that can generate numbers. A NumberGenerator provides the source of choices that determine a “walk” through the MarkovChain.

  • RandomNumberGenerator.java - This class implements NumberGenerator and just generates numbers randomly – it is used to generate truly random outputs from the trained MarkovChain.

  • ListNumberGenerator.java - This class also implements NumberGenerator. Instead of generating numbers randomly, however, it takes a list of numbers as an input, giving exact control over the sequence of numbers it outputs. That control makes it easy to test that the MarkovChain is working correctly.

The configuration of your project files should mirror that of the provided zip file. It has two directories for code, named src/main/java/org/cis1200 and src/test/java/org/cis1200, and one more directory named files that stores the CSV twitter data.

How Everything Works Together

The file TwitterBotMain.java shows the sequence of steps needed to take a CSV file, train a Markov Model, and use it to produce interesting tweets. Although you do not need to modify TwitterBotMain (except to play with different CSV input files), you should first read through it to see how the components of the project fit together.

Here are some checkpoints for this assignment. This is a long assignment, so try to do a little each day.

  • Checkpoint 1: FileUtilities.java and LineIterator.java use the BufferedReader and Iterator types. This part should be doable after seeing exceptions and I/O in lecture. You can first familiarize yourself with I/O by implementing the functions in FileUtilities.java, then proceed to complete LineIterator.java.

  • Checkpoint 2: CSV.java provides a parseRecord method that turns a line of CSV data into its constituent fields. This is a slightly trickly algorithm to implement because of the “quotes” and “commas” conventions used in CSV, but doesn’t depend on any new concepts.

  • Checkpoint 3: MarkovChain.java and MarkovChainIterator.java. This part is about 50% of the homework assignment. It will take some time to think about, because it involves two overarching tasks: (1) implementing the training functionality and (2) implementing the iteration functionality.

  • Checkpoint 4: TwitterBot.java. This is a fairly easy part of the homework assignment, because it mostly relies on using the getWalk method provided by MarkovChain and combining the results into a valid output.

Task 1: Testing (Overview)

As you work through the rest of the assignment, you should start each part by writing test cases to cement your knowledge. We have provided you with the following classes, several of which include many test cases that you can use to help understand the specification of the problem. You are expected to write additional test cases for the three classes marked with *, which you will submit with your assignment.

  • CSVTest.java - provides comprehensive coverage of CSV corner cases, for your use

  • FileUtilitiesTest.java

  • *LineIteratorTest.java

  • *MarkovChainTest.java - note that this file includes a partial test that you must complete for full credit

  • ProbabilityDistributionTest.java

  • TweetParserTest.java

  • *TwitterBotTest.java

You should write tests before you implement each task.

Run the tests frequently as you develop your code.

You may find it useful to test that your code will throw an expected exception. To do so, check out the How to Write Failing Tests section in our Java testing guide here.

Task 2: FileUtilities and LineIterator

As you know, you need to read in tweet data from a file. So, you’re going to have to do some file I/O. Java provides many classes for working with I/O, a list of which can be found here. Out of all of these BufferedReader is a good choice for simple, easy, line-by-line file reading. The class LineIterator asks you to implement an Iterator as a “wrapper” class around a BufferedReader. These two files will exercise your ability to properly handle exceptions.

Task 3: CSV

This class provides static methods for working with CSV data. CSV data is a line-oriented data format, so these methods work with either BufferedReader (for multiline CSV sources) or String (for working with one CSV record at a time).

The core operation is called parseRecord, which takes a CSV record and breaks it up into fields, following the CSV format conventions for usage of commas and quotation marks. See the comments in that file, and the extensive provided test cases in CSVTest.java for more details about what is expected here.

Kudos: Parsing quoted quotes

The kudos problem focuses on parsing CSV data with nested quotes, where double quotes within a field are denoted by two consecutive double quotes. Additional details and examples can be found in CSV.java, CSVKudosTest.java, and quotes_and_commas.csv.

Task 4: MarkovChain and MarkovChainIterator

Welcome to the Markov Chain, which is really the backbone of the whole TwitterBot. Here, we’re going to take the sequences of tokens we created in TweetParser and extract patterns from them so that we can generate tweets that are similar to the ones in our training data. There are two steps to working with the Markov Chain: training it and then generating sentences. (A Markov Chain is an extremely simplified version of ChatGPT. GPT and related “large language models” use much more sophisticated neural networks that are trained on vast amounts of data to predict the next word, whereas a Markov Chain uses simple frequency data.)

Task 4.1: Training the MarkovChain

Let’s say you wanted to try to have a computer generate Shakespeare sonnets, so you downloaded a bunch of Sonnets as training data. How would you make your output look like your training data? You might start by calculating the number of occurrences of each word, and then having your generator produce words based on how frequently they appear. For example, the word “and” would show up more often than “kangaroo”. While you might end up with a lot of thou’s, thee’s, and thine’s, they are likely to not be in the correct order as you are essentially just grabbing common Shakespearian words randomly. English has a clear order and structure of words: adjectives come before nouns, adverbs before verbs, etc. While hard to program this relationship for all words by hand and ensure the word choice makes logical sense (i.e: while “a fast car is driving down the street” and “a fast rock is driving down the street” are both grammatically correct, the second doesn’t make much sense… unless you are talking about The Rock, of course), we can use machine learning to find these patterns and relationships without much work. Notice above how we put elements of English into certain pairs, like adjectives before nouns. This suggests that one potential option would be to track pairs of words. In fact, this is the approach we’ll use in this assignment.

One way to implement this would be to record a list of words that come after each word we find. We can do this by looking at individual pairs of adjacent words in our training text, which we’ll call bigrams . Bigrams are going to be the basis of how our Markov Chain works, so we’ll walk you through an example of how bigrams are created to make sure the concept is clear. Say we have the following sentence, and we want to generate a list of bigrams:

"an apple is grown on an apple tree."

To make the list of bigrams, we’d start by taking the first pair of words: “an” and “apple,” and putting them together in a bigram. After that, we’d move on to the next pair: “apple” and “is.” Notice that the first and second pairs overlap. This is intentional! It allows us to represent the sentence as a linked chain of words, where we can move from one word to the next. (That is why this approach is called a “Markov Chain”.)

Here’s the list of bigrams we get from continuing the process:

("an", "apple"),
("apple", "is"),
("is", "grown"),
("grown", "on"),
("on", an"),
("an", "apple"),
("apple", "tree"),
("tree", "."),
(".", <END>)

Note that we treat punctuation as its own “token”, so "tree" is followed by ".". In general, we will train the Markov Chain on tokens that are either words (by which we mean strings consisting of alphanumeric characters, plus the characters #, @, and ') or punctuation (by which we mean all other non-whitespace symbols). We treat different capitalizations of the same word as different tokens, so “An” and “an” would be considered to be different – this distinction is useful because the Markov Chain can learn that capitalized words are more likely to start tweets and follow punctuation tokens.

Observe that we still create a bigram that starts with the last token, which in this case is the token “.”, even though there’s no token after it. This helps us keep track of which tokens occur at the end of a tweet. We keep track of this with an end marker, which is the constant END_TOKEN = "<END>".

Now we can use this list of bigrams to generate what we call a probability distribution (also known as “frequency data”) for each word. This probability distribution tells us how likely any word is to follow after a given word. We’ll figure this out based on the number of times one word follows the previous one in our training data. To compute this distribution, we can start by listing out every unique word in the sentence and the list of words that we’ve seen after it in our set of bigrams.

Token Following tokens
an apple,apple
apple is,tree
is grown
grown on
on an
tree .
. <END>

Let’s convert this table into a list of probability distributions. “an” is only ever followed by “apple” in our training data, so the probability distribution for “an” will specify that “apple” follows “an” 100% of the time. “apple” was followed by “tree” once and “is” once, so we give each of those words a value of 50% in our probability distribution. Continuing with this process gives us this table:

Token Probability distribution Frequency Data
an apple (100%) apple: 2
apple is (50%), tree (50%) is: 1, tree: 1
is grown (100%) grown: 1
grown on (100%) on: 1
on an (100%) an: 1
tree . (100%) .: 1
. <END> (100%) <END>: 1

Note that in our code for ProbabilityDistribution we represent the distribution as its corresponding frequency data, which maps strings to their frequencies. For instance, the probability distribution associated with “apple” maps “is” to 1 and “tree” to 1. This information is stored in the field called bigramFrequencies of the MarkovChain class.

There’s one more small wrinkle: The MarkovChain also has to keep track of which words might begin a sentence. To do so, we keep track of another probability distribution called startTokens, which records the frequencies with which each word starts a sentence. In the simple example above, there was just one sentence and it began with “an”, so the resulting distribution will just map “an” to 1. Richer training sets will include more starting words.

Your first coding step for this task is to implement addBigram(), which works by adding or modifying records in the Markov Chain’s map of probability distributions to record a pair of adjacent words. The record() method from the ProbabilityDistribution class can be useful here. Note that the dictionary maps String to ProbabilityDistribution. This is because, for every observed word, we want to keep track of a separate probability distribution (like we did in the example above) of the words that follow it. Your next job is to implement addSequence(), which takes in an iterator of tokens that form a tweet and updates your model based on that sequence by adding all bigrams that occur in that sentence. You will also need to update the ProbabilityDistribution of startTokens so that your MarkovChain knows realistic places to start. For example, if 90% of sentences in your training data start with the word “the,” then using a ProbabilityDistribution to find the start word will ensure that 90% of the sentences generated by your model start with “the.”

Finally, complete the (non-default) constructor for the MarkovChain class, which takes a whole collection of training data and builds a fully trained model. It should simply iterate through all of the sequences and add each of them to the model.

And that’s it for training! Our Markov Chain is ready to go.

Task 4.2: Generating sentences

Walks through the Markov Chain

Once the Markov Chain has been trained, we can use it to generate some entirely new tweets. We first select a start token based on the tokens that showed up as start tokens in our training set. Since we only had one sentence, which started with “an,” we have no choice but to select “an” as our start token. After that, we look at “an” in our probability distribution and randomly choose the token that goes next based on this distribution. It turns out that “apple” is the following token 100% of the time, so we’ll choose apple. So far, our sentence is “an apple.” Once we get to apple, our Markov Chain has a choice: it can choose “is” 50% of the time and “tree” 50% of the time. Let’s say it chooses “tree”. We then go to the probability distribution for “tree” and see that it can only ever be followed by the token “.” at the end of the sentence. As a result, our Markov Chain terminates with the (incomplete) sentence “An apple tree.” As you can see, this process isn’t very exciting with such a small training set, but we can get some cool results once we start to play with real-world Twitter data! (Though even then, the model isn’t sophisticated enough to truly take grammar into account.)

This process of choosing a start token and then following the probability distribution until we find the END token is called a walk through the Markov Chain. We can represent a walk as an Iterator that provides the next and hasNext operations, such that each call to next yields the next token encountered along the walk. The MarkovChain class provides a method called getWalk that returns such an MarkovChainIterator.

For a truly random walk, as described above, the MarkovChainIterator would choose the sequence of tokens arbirarily from the probability distribution of the underlying Markov Chain data set. However, such randomized algorithms are very difficulty to test! Instead, the getWalk method takes a parameter, NumberGenerator ng, which acts as a “source of choices” that are used to guide the walk along a path through the Markov Chain.

A NumberGenerator itself is a bit like an Iterator: it provides just one operation, int next(int bound), that returns an integer x satisfying 0 <= x < bound. Like an Iterator, a NumberGenerator can be called multiple times to get a series of numbers.

To use the NumberGenerator to produce a walk, getWalk creates an instance of MarkovChainIterator that uses those numbers as the “choices” made at each step of the way. The MarkovChainIterator should maintain some state that records the next token along the chosen path and update it at each call to next. When selecting the starting token to begin the walk, the MarkovChainIterator uses the number generator to pick from among all of the possibilities in startTokens. Thereafter, it uses the number generator to pick from among the possible next options, following the probability distributions according to the Markov Chain bigramFrequencies.

The coding for this task is to implement the Iterator interface as required by MarkovChainIterator, following the strategy above. First complete the constructor, which should initialize the ng field and pick the starting token of the path. You will need to add at least one additional field to the MarkovChainIterator class – think about what invariant might be useful to maintain for that field. Next, implement the hasNext method, which should be quite straightforward if you have chosen a good invariant. Finally, implement the next() method, which should yield the next token along the path and set up subsequent calls to next and hasNext. This should involve using the pick operation of ProbabilityDistribution.

See the detailed comments about the illustrative example in MarkovChain.java for more details about this task.

And there you go! You implemented a basic Machine Learning algorithm! Go enjoy some Insomnia Cookies or something or just stare at The Rock some more.

A visual explanation of the Markov Chain structure

If you are more visually inclined (or you enjoy graph theory), you might appreciate the illustration below for how Markov Chains work. Otherwise, feel free to skip right over this, or just spend a minute dragging around the diagram below, as it’s kinda mesmerizing. Each lettered circle would represent some word in our Markov Chain. Notice the arrows from one word to another. These arrows would represent a pairing of the words in a bigram, and the size of the arrow in the diagram illustrates the relative frequency at which that word occurred after its parent word in the training data. Every time the highlighted circle moves from one word to another, the Markov Chain generates that word in its output–each transition corresponds to one call to next. This diagram already looks hard to follow, but imagine one of these diagrams being created with thousands of words (which is more reasonable for a training dataset). Generating a sentence or tweet would still be the same however, as you would just walk from one word to another based off of the frequencies of occurrence in the training data.

Task 5: TwitterBot

Now, it’s time to actually generate a tweet! The TwitterBot class provides a constructor that instantiates a trained MarkovChain from some training data. Your only part in this task is implementing the generateTweet() method, which simply uses the getWalk functionality to produce a sequence of tokens that should be concatenated, with appropriate spaces added, to form a complete tweet. This is fairly straight forward process: you should simply use a StringBuilder along with a bit of logic to determine whether to add a space or not. That’s it!

Hint: you can use the matches method to determine whether a token is punctuation.

Running the TwitterBot

Now it’s time to run your TwitterBot! At the top of the TwitterBotMain class, which provides the whole TwitterBot pipeline, there is a class field called PATH_TO_TWEETS. This is the path to the CSV with training data that will be used to create your TwitterBot.

We have that field pre-filled to a CSV of tweets based on the illustrative example from MarkovChain.java, but you can try it out on other examples too.

One fun source is dog_feelings_tweets.csv, which is cute things that dogs might say (source: Dog Feelings Twitter <—- yep this data is from a real twitter account). Feel free to change the path to other files (we provided some extra CSVs from other twitter accounts). Note: you will have to change the columnForPath field if you use a different file as it might have the tweets in a different column than the dog tweets.

Finally to run the TwitterBot, just open up the TwitterBotMain class and press run. You should see some tweets being output to the console.

Submission and Grading

Submission Instructions

You will be uploading only the following files:

  • src/main/java/org/cis1200/CSV.java
  • src/main/java/org/cis1200/FileUtilities.java
  • src/main/java/org/cis1200/LineIterator.java
  • src/main/java/org/cis1200/MarkovChain.java
  • src/main/java/org/cis1200/MarkovChainIterator.java
  • src/main/java/org/cis1200/TwitterBot.java
  • src/test/java/org/cis1200/LineIteratorTest.java
  • src/test/java/org/cis1200/MarkovChainTest.java
  • src/test/java/org/cis1200/TwitterBotTest.java

If you are using Codio

These files should be organized similar to above (with a src and test directory). The easiest option is to use the Zip Project menu item in Codio. Do not include any of the other provided files, since doing so may cause your submission to fail to compile. You can directly upload the zip produced by Codio into Gradescope.

If you are using IntelliJ

Gradescope allows you to easily drag-and-drop files into it, or you can click “Browse” in Gradescope to open up a file browser on your computer to select files. Upload only the files listed above.

Grading Breakdown

  • Automated testing (90%)
    • Task 2: FileUtilities and LineIterator (16%)
    • Task 3: CSV (18%)
    • Task 4: MarkovChain and MarkovChainIterator (45%)
    • Task 5: TwitterBot (6%)
    • Style (5%)
  • Manual grading (10%)
    • Task 1: Quality of testing (10%)

You have three free submissions for this assignment. Each additional submission will cost you five points from your final score.


Attribution

This homework was created by the CIS 120 19fa Development Committee (Alex Seidel, Angela Xi, Daniel Like, Nico Melton, Nicolas Corona, and William Goeller).

It was signficantly refactored in Fall 2023 / Spring 2024 by Steve Zdancewic.