pennpals

Homework 8: TwitterBot

LinkedIn is for the people you know. Facebook is for the people you used to know. Twitter is for people you want to know.
     — Unknown


Assignment Overview

In this assignment, you will learn how to read tweets from CSV files into a program, build your own machine learning model, and create an AI that generates (somewhat) realistic tweets! Before we jump into the code or the specifics of the project, there are a few concepts you should familiarize yourself with:

Task 0: Understanding the Problem

Concept: Training Data

In machine learning and artificial intelligence, training data is data you input into a model so it ‘learns’ how to do something. For example, to build an AI that can tell if an image contains a cat, you might have to first train it with thousands of images and tell it which ones contain cats and which ones don’t. That way, the model can extract patterns or features about images that contain cats, and then use them in guessing about an image it has never seen before. The thousands of photos input to the model would be the ‘training data’ in this example. In this project, the training data will be collections of tweets obtained from real twitter data.

Concept: Data Cleaning

Generally speaking, higher quality training data produces higher quality results. Now, what exactly makes training data ‘high quality’ depends on the model you are using and the task at hand. For this tweet bot, the training data will be hundreds of tweets from a real twitter account. Using our understanding of what these tweets look like and how our TwitterBot will generate new tweets, we can improve the bot's performance by filtering and cleaning the raw tweets, yielding 'cleaned' training data with which to build the model. For example, we don't want any URLs to show up in our processed data so we filter out all the tweets that contain URLs.

Concept: Regular Expression

To filter out undesirable tweets, we specify patterns of string data that are considered "bad". A regular expression is a simple way to define such patterns.

Concept: Markov Chain

The specific model you will use in this assignment is called a Markov Chain. To train your Markov Chain, you'll pass in tweets as sequences of words, which will then be analyzed in pairs so that your model can figure out which words should follow which other words. Don't worry if this doesn't make sense to you yet! We'll go into a lot more detail about how this works later in the assignment.

Concept: CSV File

CSV stands for Comma Separated Value. It’s a commonly used file format for storing data. A properly formatted CSV file will have the following shape:

line1_value1, line1_value2, … line1_valueN
line2_value1, line2_value2, … line2_valueN

lineK_value1, lineK_value2, … lineK_valueN

As you can see, each line of the file has different values that are, well, separated by commas. CSVs are commonly used to represent tables. Each line in the file would be a row in a table, and each value in the line would represent a cell in the table. For example, if I wanted to use a CSV as a phonebook, I’d write something like:

Tupac Shakur, New York City, 212-779-1269, tupac@deathrowrecords.com
John Lennon, Liverpool, +44 151-496-0367, john@johnlennon.com
Prince Rogers Nelson, Minneapolis, 612-200-5655, contact@prince.com

This is great because since each “column” contains the same type of data, it’s easy to write code to extract whatever you need from the file. In this particular assignment, we have provided several CSV files for you to try out your code on. You can open a CSV file in two ways: 1) using notepad or another text editor, where you see the data like above in a CSV format, and 2) in Excel (or another spreadsheet), where the CSV will be converted to a row x column alignment.

In this assignment, one of your tasks will involve figuring out how to extract just the raw tweets from the file. Then, you will work with that raw data in various ways, including cleaning and validating to improve the performance of the TwitterBot (see the Concept: Training Data and Data Cleaning sections).

File Breakdown

The list below gives a brief description of the provided source files. The ones marked by * are the only ones you need to modify.

  • *FileLineIterator.java - This class is a wrapper around the Java I/O class BufferedReader (see lecture slides here) that provides a simple way to work with file data line-by-line. By writing this class, you are creating a simple, nifty file reading utility for yourself.

  • *TweetParser.java - This class reads in tweet data from a file, and then cleans, filters and formats the data to improve the quality of the training data. Its interface also makes it easy for you to add that cleaned data to the Markov Chain.

  • *MarkovChain.java - This is the class to implement, well, the Markov Chain, which will count all the times that certain words follow other words in the training data. In addition to storing that data, it implements Iterator, and does so to provide a convenient way to continuously pick successive random elements,

  • *TwitterBot.java - This is the class where you put everything together. Here, you will use TweetParser to clean and parse tweet data, write logic for adding that data to an instance of MarkovChain, and then use the populated MarkovChain to generate tweets.

  • ProbabilityDistribution.java - This class is useful for counting occurrences of things. You will use it in building MarkovChain. It contains a Map<T, Integer>, where the keys are instances of objects of type T and the Integer represents the count for the number of times that object occurred. It also provides a convenient way to randomly pick one of its keys.

  • NumberGenerator.java - This is an interface for any class that can generate numbers. It is used in ProbabilityDistribution for picking keys.

  • RandomNumberGenerator.java - This class implements NumberGenerator and just generates numbers randomly.

  • ListNumberGenerator.java - This class implements NumberGenerator and also just generates numbers. Instead of being random, though, it takes a list of numbers as an input, so the user has control over what numbers it outputs. It’s useful to swap this in place of a RandomNumberGenerator when you need to eliminate randomness in order to test your MarkovChain.

The configuration of your project files should mirror that of the provided zip file. It has two directories for code, named src and test, repespectively, and one, called files that stores the CSV twitter data.

Task 1: Testing

Now that you have a better understanding of the problem, you will want to write some tests to cement your knowledge. We have provided you with the following classes: FileLineIteratorTest.java, TweetParserTest.java, MarkovChainTest.java, and TwitterBotTest.java. These are in the test folder of the downloadable zip file. Make sure to write your tests in those files. We have provided you with example tests as a model for your own. Before tackling (the parts of) each of the next tasks, it is a good idea to write some test cases. Run the tests frequently as you develop your code.

These test cases (and the TwitterBot application) use CSV files located in the files directory. Note that the test cases we provide use only a few of the provided CSV files, but you may write your own test cases that use the other files. (You might want to look at those CSV files to see examples of the kinds of inputs your code will need to be able to process.)

Task 2: FileLineIterator

As you know, you need to read in tweet data from a file. So, you’re going to have to do some file I/O. Java provides many classes for working with I/O, a list of which can be found here. From that list, BufferedReader is a good choice for simple, easy, line-by-line file reading.

The goal for this task is to create an Iterator over the lines of a file. In other words, the Iterator should read one line at a time. Every iterator must implement two methods: hasNext() and next(). hasNext() returns true if there is another element in the iterator and false otherwise. next() returns the next element of the Iterator or throws a NoSuchElementException if none exists (i.e. hasNext() is false).

Implement these two methods and the constructor so that a FileLineIterator will return the next line of a file as a String every time the user calls next(). You may want to refer to the Iterator Javadocs for more information on iterators. Note: Remember to close your readers and writers in this class and throughout the homework.

Task 3: TweetParser

The purpose of this class is to take a CSV file containing tweet data as an input and transform that data into a form that A) can be used to easily build a Markov Chain and B) optimizes the performance of the Markov Chain. TweetParser provides crucial functionality through its main public method, csvFileToTrainingData(). We represent that training data contained in a tweet as objects of type List>. Each inner list of Strings represents a sentence that has been cleaned (puctuation and other non-word data removed) and divided into words (each is just a string). A whole tweet's data is a list of such sentences. For example, a tweet represented by the String "CIS 120 is great! CIS 120 is fun." would be turned into training data shaped like the following (where we use OCaml list notation since Java doesn't have similar notation): [["cis"; "120"; "is"; "great"];["cis"; "120"; "is"; "fun"]]

Creating the training data works in a couple of stages. First each file is converted into a list of tweets. Then each tweet is filtered to remove bad strings (like URLs) and then broken up into a list sentences. Each sentence is itself cleaned up: converting words to lower case and dropping non-word data (like symbols). Each of these steps is represented by a different method of TweetParser.

You should use FileLineIterator when writing csvFileToTweets(). TweetParser also contains a few other private methods, including some that you will write. They'll be needed by parseAndCleanTweet(), which you should use as part of your implementation of csvFileToTrainingData(). You will see that the "cleaning" code uses a couple of regular expressions defined as constants in the file. You won't need to create such complex regular expressions in your code, but you might want to use a (very simple!) regular expression and the String.split method in some places.

Task 4: MarkovChain

Welcome to the Markov Chain, which is really the backbone of the whole TwitterBot. Here, we're going to take the sequences of words we created in TweetParser and extract patterns from them so that we can generate tweets that are similar to the ones in our training data. There are two steps to working with the Markov Chain: training it and then generating sentences

Task 4.1: Training the MarkovChain

Let's say you wanted to try to have a computer generate Shakespeare sonnets, so you downloaded a bunch of Sonnets as training data, which is essentially just the data you want your output to resemble. How would you make your output look like your training data? You might first start by calculating the number of occurrences of each word, and then having your generator produce words based on how frequently they appear. For example, the word “and” would show up more often than “kangaroo”. While you might end up with a lot of thou's, thee's, and thine's, they are likely to not be in the correct order as you are essentially just grabbing common Shakespearian words randomly. English has a clear order and structure of words: adjectives come before nouns, adverbs before verbs, etc. While hard to program this relationship for all words by hand and ensure the word choice makes logical sense (i.e: while "a fast car is driving down the street" and "a fast rock is driving down the street" are both grammatically correct, the second doesn't make much sense... unless you are talking about The Rock, that is), we can use machine learning to find these patterns and relationships without much work. Notice above how we put elements of English into certain pairs, like adjectives before nouns. This suggests that one potential option would perhaps be to track pairs of words. In fact, this is the approach we'll use in this assignment.

One way to implement this would be to record a list of words that come after each word we find. We can do this by looking at individual pairs of adjacent words in our training text, which we'll call bigrams . Bigrams are going to be the basis of how our Markov Chain works, so we'll walk you through an example of how bigrams are created to make sure the concept is clear. Say we have the following sentence, and we want to generate a list of bigrams:

"An apple is grown on an apple tree."

To make the list of bigrams, we'd start by taking the first pair of words: "An" and "apple," and putting them together in a bigram. After that, we'd move on to the next pair: "apple" and "is." Notice that the first and second pairs overlap. This is intentional! It allows us to represent the sentence as a linked chain of words, where we can move from one word to the next.

Here's the list of bigrams we get from continuing the process:

("An", "apple"), ("apple", "is"), ("is", "grown"), ("grown", "on"), ("on", an"), ("an", "apple"), ("apple", "tree"), ("tree", END)

Notice that we still create a bigram that starts with the last word, even though there's no word after it. This helps us keep track of which words take place at the end of a sentence, which we keep track of with an end marker. In your code, you'll use null as an end marker.

Now we can use this list of bigrams to generate what we call a probability distribution (also known as "frequency data") for each word. This probability distribution tells us how likely any word is to follow after a given word. We'll figure this out based on the number of times one word follows the previous one in our training data. To compute this distribution, we can start by listing out every unique word in the sentence and the list of words that we've seen after it in our set of bigrams.

Word Following words
an apple, apple
apple is, tree
is grown
grown on
on an
tree END

Let's convert this table into a list of probability distributions. "An" is only ever followed by "apple" in our training data, so the probability distribution for "an" will specify that "apple" follows "an" 100% of the time. "Apple" was followed by "tree" once and "is" once, so we give each of those words a value of 50% in our probability distribution. Continuing with this process gives us this table:

Word Probability distributionFrequency Data
an apple (100%) apple:2
apple is (50%), tree (50%) is:1, tree:1
is grown (100%) grown:1
grown on (100%) on:1
on an (100%) an:1
tree END (100%) END:1

Note that in our code for ProbabilityDistribution we represent the distribution as its corresponding frequency data, which maps strings to their frequencies. For instance, the probability distribution associated with "apple" maps "is" to 1 and "tree" to 1. This information is stored in the field called chain of the MarkovChain class.

There's one more wrinkle: The MarkovChain also has to keep track of which words might begin a sentence. To do so, we keep track of another probability distribution called startWords, which records the frequencies with which each word starts a sentence. In the simple example above, there was just one sentence and it began with "an", so the resulting distribution will just map "an" to 1. Richer training sets will include more starting words.

And that's it for training! Our Markov Chain is ready to go.

Your first coding step for this task is to implement addBigram(). addBigram() works by adding or modifying records in the Markov Chain's map of probability distributions to record a pair of adjacent words. Note that the dictionary maps String to ProbabilityDistribution. This is because for every observed word we want to keep track of a separate probability distribution (like we did in the example above) of the words that follow it. Your next job is to implement train(), which takes in an iterator of words that form a sentence and updates your model based on that sequence by adding all bigrams that occur in that sentence. You will also need to update the ProbabilityDistribution of startTokens so that your MarkovChain knows realistic places to start. For example, if 90% of sentences in your training data start with the word "the," then using a ProbabilityDistribution to find the start word will ensure that 90% of the sentences generated by your model start with "the."

To complete the TwitterBot, you'll be training your Markov Chain with a separate iterator for each sentence. This is because each sentence can be treated as a completely separate sequence of words, as we only care about which words are ordered consecutively within the same sentence. You'll need to implement this functionality in the last step of the assignment.

Task 4.2: Generating sentences

Once the Markov Chain has been trained, we can use it to generate some entirely new sentences. We first select a start word based on the words that showed up as start words in our training set. Since we only had one sentence, which started with "an," we have no choice but to select "an" as our start word. After that, we look at "an" in our probability distribution and randomly choose the word that goes next based on this distribution. It turns out that "apple" is the following word 100% of the time, so we'll choose apple. So far, our sentence is "An apple." Once we get to apple, our Markov Chain has a choice: it can choose "is" 50% of the time, or "tree" 50% of the time. Let's say it chooses tree. We then go to the probability distribution for "tree" and see that it can only ever be at the end of the sentence. As a result, our Markov Chain terminates with the (incomplete) sentence "An apple tree." As you can see, this process isn't very exciting with such a small training set, but we can get some really cool results once we start to play with real-world Twitter data! (Though even then, the model isn't sophisticated enough to truly take grammar into account.)

If you are more visually inclined (or you enjoy graph theory), you might appreciate the illustration below for how Markov Chains work. Otherwise, feel free to skip right over this, or just spend a minute dragging around the diagram below, as it's kinda mesmerizing. Each lettered circle would represent some word in our Markov Chain. Notice the arrows from one word to another. These arrows would represent a pairing of the words in a bigram, and the size of the arrow in the diagram illustrates the relative frequency at which that word occurred after its parent word in the training data. Every time the grey circle moves from one word to another, the Markov Chain generates that word in its output. This diagram already looks hard to follow, but imagine one of these diagrams being created with thousands of words (which is more reasonable for a training dataset). Generating a sentence or tweet would still be the same however, as you would just walk from one word to another based off of the frequencies of occurrence in the training data.

If you understand the idea here, then you understand a Markov Chain, because it’s simply a data structure that tracks the number of times one thing follows another. You’ll write the code for one and then use it in your TwitterBot as we describe below.

The first coding step is to implement reset(), which sets the MarkovChain's iterator to begin at a new start word.

Consider the following example: you have a Markov Chain called yourMarkovChain that creates sentences, and yourMarkovChain is about to return the word "apple". You want to create a new sentence starting with the start word "an", so you call yourMarkovChain.reset("an");. Before, yourMarkovChain was about to return "apple". Now, it's about to return "an", and subsequent calls to yourMarkovChain.next() will continue along some random path started by "an" until you use reset() again.

Here are before and after images:

cursorAtE cursorAtC

Note that there are two versions of reset: one of them takes in a specific start word and the other one chooses a random start word. You only need to implement the method that takes in a specific start word. This method's implementation will vary depending on how you store the MarkovChain iterator's current element. Consider adding a field to your class for this purpose. Look at the lecture slides and recitation materials to get an idea of what fields iterators typically have.

The second step is to implement the iterator's hasNext() method, which should be very short.

Finally you have to write next(). Just like the above two methods, your implementation depends on how you store the current element in the iterator. You'll need to use the ProbabilityDistribution's methods to help you. You'll also need to use the NumberGenerator ng class field.

And there you go! You implemented a basic Machine Learning algorithm! Go enjoy some Insomnia Cookies or something or just stare at The Rock some more.

Task 5: TwitterBot

Now, it’s time to put everything together! In TwitterBot, you will use all of the classes you’ve written thus far to create a real working TwitterBot. The most important parts of this class are the constructor, which extracts sentences from the training tweets and uses them to train the Markov Chain, and the generateTweet() method, which uses the algorithm we explained above in the Concept: Markov Chain section to generate the sentences that form the new tweet.

Running the TwitterBot

Now it's time to run your TwitterBot! At the top of the TwitterBot class, there is a class field called pathToTweets. This is the path to the CSV with training data that will be used to create your TwitterBot. We have that field pre-filled to a CSV of tweets of cute things that dogs might say (source: Dog Feelings Twitter <---- yep this data is from a real twitter account). Feel free to change the path to other files (we provided some extra CSVs from other twitter accounts). Note: you will have to change the columnForPath field if you use a different file as it might have the tweets in a different column than the dog tweets. Finally to run the TwitterBot, just open up the TwitterBot class and press run. You should see some tweets being output to the console.

This biggest piece of this task is implementing the generateTweet() method. You also must implement writeStringsToFile, a simple method that we can use in the main method to write the generated tweets to a text file.

Submission and Grading

Submission Instructions

As with previous assignments, you will be uploading hw08-submit.zip, a compressed archive containing only the following files:

  • src/FileLineIterator.java
  • src/TweetParser.java
  • src/MarkovChain.java
  • src/TwitterBot.java
  • test/FileLineIteratorTest.java
  • test/MarkovChainTest.java
  • test/TweetParserTest.java
  • test/TwitterBotTest.java
Please upload your .java files (both testing and regular) to the CIS 120 submission site.

If you are using Codio

These files should be organized similar to above (with a src and test directory). The easiest option is to use the Zip menu item in Codio. Do not include any of the other provided files, since doing so may cause your submission to fail to compile.

If you are using Eclipse

  • Alternative 1 - Zip your files from Eclipse using the instructions below.
  • Follow these instructions to create and upload hw08-submit.zip:

    • Right click on project, select Export...
    • Expand General, select Archive File
    • Only select the files listed above (click on arrow in window on left and select files in the window on the right)
    • Browse... -> Save as "hw08-submit" in one of Desktop/Documents/Downloads
    • Select "Save in zip format"
    • Finish
    • Go to submission site, find file in Desktop/Documents/Downloads and then upload
  • Alternative 2 - Copy-Paste your code in Codio and zip from there.

You have three free submissions for this assignment. Each additional submission will cost you five points from your final score.

Grading Breakdown

  • Automated testing (85%)
    • Task 2: FileLineIterator (10%)
    • Task 3: TweetParser (25%)
    • Task 4: MarkovChain (25%)
    • Task 5: TwitterBot (25%)
  • Manual grading (15%)
    • Style and design (5%)
    • Task 1: Quality of testing (10%)

Attribution

This homework was created by the CIS 120 19fa Development Committee
(Alex Seidel, Angela Xi, Daniel Like, Nico Melton, Nicolas Corona, and William Goeller)