Package org.cis1200

Class MarkovChain

java.lang.Object
org.cis1200.MarkovChain

public class MarkovChain extends Object
A Markov Chain is a data structure that tracks the frequency with which one token follows another token in a collection of sequences. (Recall that a "token" is a nonempty String produced by TweetParser containing either a word or punctuation.) This project uses a Markov Chain to model tweets by gathering the frequency information from a Twitter feed. The MarkovChain to generates "plausible" tweets by constructing a random walk through the chain according to the frequencies. Please see the homework instructions for more information on Markov Chains.

TRAINING:

An ILLUSTRATIVE EXAMPLE (see the corresponding test cases throughout the project files). Suppose we want to train the MarkovChain on these two Strings that represent tweets:

"a table and a chair" and

"a banana! and a banana?"

We first parse these two tweets into sequences of String tokens that represent the individual words and punctuation marks of the tweet. (See TweetParser.parseAndCleanTweet(String).) For these two tweets, we get the lists below. (Note how the punctuation is separated as individual tokens.)
["a", "table", "and", "a", "chair"]
["a", "banana", "!", "and", "a", "banana", "?"]

The MarkovChain that results from this training data maps each observed token to a ProbabilityDistribution that is based on the recorded occurrences of bigrams (adjacent tokens) in the data. The MarkovChain also computes a ProbabilityDistribution that contains the frequencies with which words start the tweets of the training data.

If we print the Markov Chain resulting from the training data above, we see:

  ILLUSTRATIVE EXAMPLE MARKOV CHAIN:
  startTokens: { "a":2 }
  bigramFrequencies:
  "!":    { "and":1 }
  "?":    { "<END>":1 }
  "a":    { "banana":2  "chair":1  "table":1 }
  "and":  { "a":2 }
  "banana":   { "!":1  "?":1 }
  "chair":    { "<END>":1 }
  "table":    { "and":1 }
 

For this training data, because both tweets start with "a", the startTokens distribution simply records that fact.

The bigramFrequencies data structure records the information about occurrences of adjacent tokens in the tweets. For instance, the token "a" is followed by "banana" twice, "chair" once, and "table" once. The token "!" is followed by "and" once, whereas "?" (like "chair") appears only at the end of a tweet.

NOTE: we use the END_TOKEN marker "<END>" to mark the end of a tweet.

SAMPLING

Once we have trained the Markov Chain, we can use it to generate new sequences that mimic the training inputs. This is done by conducting a "random" walk through the chain.

The walk begins by choosing a token from the startTokens distribution. In the running example, since both tweets start with the token "a", the only possible starting token is "a".

Then, a sequence of tokens is generated by choosing the next symbol according to the bigram distributions until the END_TOKEN is encountered, at which point the walk is finished. For example. after the token "a", the walk might pick "chair", and then, because "chair" is always followed by the END_TOKEN, the walk is complete. A different walk might yield the sequence "a table and a banana?"

We model this random walk process as an Iterator that, given a source of (random) numbers, yields the sequence of tokens visited by choosing that "path" through the Markov Chain, as explained in more detail below.

Your job is to complete the code in this class to provide the functionality described above.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    (package private) final Map<String,ProbabilityDistribution<String>>
    for each token, probability distribution of next token in a sentence
    (package private) static final String
    end of sentence marker
    (package private) final ProbabilityDistribution<String>
    probability distribution of initial tokens in a sentence
  • Constructor Summary

    Constructors
    Constructor
    Description
    Construct an empty MarkovChain that can later be trained.
    MarkovChain(List<List<String>> trainingData)
    Construct a trained MarkovChain from the given list of training data.
  • Method Summary

    Modifier and Type
    Method
    Description
    (package private) void
    addBigram(String first, String second)
    Adds a bigram to the Markov Chain information by recording it in the appropriate probability distribution of bigramFrequencies.
    void
    Adds a single tweet's training data to the Markov Chain frequency information, by: recording the first token in startTokens recording each subsequent bigram of co-occurring pairs of tokens recording a final bigram of the last token of the tweet and END_TOKEN to mark the end of the sequence
    Generate a list of numbers such that if it is installed as the number generator for the MarkovChain, and used as an iterator, the tokens returned in sequence will be the list of provided tokens.
    (package private) ProbabilityDistribution<String>
    get(String token)
    Returns the ProbabilityDistribution for a given token.
    Gets a random walk through the Markov Chain.
    Gets a walk through the Markov Chain that follows the path given by the NumberGenerator.
    Use this method to print out markov chains with tokens and probability distributions.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

  • Constructor Details

  • Method Details

    • addBigram

      void addBigram(String first, String second)
      Adds a bigram to the Markov Chain information by recording it in the appropriate probability distribution of bigramFrequencies. (If this is the first time that first has appeared in a bigram, creates a new probability distribution first.) You may assume that the two Strings in the bigram are tokens that have been produced by TweetParser.
      Parameters:
      first - The first token of the Bigram (should not be null)
      second - The second token of the Bigram (should not be null)
      Throws:
      IllegalArgumentException - - when either parameter is null
    • addSequence

      public void addSequence(Iterator<String> tweet)
      Adds a single tweet's training data to the Markov Chain frequency information, by:
      1. recording the first token in startTokens
      2. recording each subsequent bigram of co-occurring pairs of tokens
      3. recording a final bigram of the last token of the tweet and END_TOKEN to mark the end of the sequence

      Does nothing if the tweet is empty.

      Note that this is where you train the MarkovChain.

      Parameters:
      tweet - an iterator representing one tweet of training data
      Throws:
      IllegalArgumentException - when the tweet Iterator is null
    • get

      Returns the ProbabilityDistribution for a given token. Returns null if none exists. This function is implemented for you.
      Parameters:
      token - - the token for which the ProbabilityDistribution is sought
      Returns:
      a ProbabilityDistribution
      Throws:
      IllegalArgumentException - - when parameter is null
    • getWalk

      public Iterator<String> getWalk(NumberGenerator ng)
      Gets a walk through the Markov Chain that follows the path given by the NumberGenerator. See MarkovChainIterator for the details. This function is implemented for you.
      Parameters:
      ng - the path to follow (represented as a NumberGenerator, assumed nonnull)
      Returns:
      an Iterator that yields the tokens on that path
    • getRandomWalk

      public Iterator<String> getRandomWalk()
      Gets a random walk through the Markov Chain. This function is implemented for you.
      Returns:
      an Iterator that yields the tokens on that path
    • findWalkChoices

      public List<Integer> findWalkChoices(List<String> tokens)
      Generate a list of numbers such that if it is installed as the number generator for the MarkovChain, and used as an iterator, the tokens returned in sequence will be the list of provided tokens. Note that the length of the list of numbers is equal to the length of the list of tokens plus one (for the END_TOKEN). This function is implemented for you.
      Parameters:
      tokens - an ordered list of tokens that the MarkovChain should generate
      Returns:
      a list of integers representing a walk through the Markov Chain that produces the given sequence of tokens
      Throws:
      IllegalArgumentException - when any of the following are true:
      • tokens is null or empty
      • the first token in the list is not in startTokens
      • any of the tokens in the list is not found as a key in the chain
      • if the last token of the list cannot transition to END_TOKEN
    • toString

      public String toString()
      Use this method to print out markov chains with tokens and probability distributions. This function is implemented for you.
      Overrides:
      toString in class Object