org.cis1200.MarkovChain

public class MarkovChain extends Object

A Markov Chain is a data structure that tracks the frequency with which one token follows another token in a collection of sequences. (Recall that a "token" is a nonempty String produced by TweetParser containing either a word or punctuation.) This project uses a Markov Chain to model tweets by gathering the frequency information from a Twitter feed. The MarkovChain to generates "plausible" tweets by constructing a random walk through the chain according to the frequencies. Please see the homework instructions for more information on Markov Chains.

TRAINING:

An ILLUSTRATIVE EXAMPLE (see the corresponding test cases throughout the project files). Suppose we want to train the MarkovChain on these two Strings that represent tweets:

"a table and a chair" and

"a banana! and a banana?"

We first parse these two tweets into sequences of String tokens that represent the individual words and punctuation marks of the tweet. (See TweetParser.parseAndCleanTweet(String).) For these two tweets, we get the lists below. (Note how the punctuation is separated as individual tokens.)
["a", "table", "and", "a", "chair"]
["a", "banana", "!", "and", "a", "banana", "?"]

The MarkovChain that results from this training data maps each observed token to a ProbabilityDistribution that is based on the recorded occurrences of bigrams (adjacent tokens) in the data. The MarkovChain also computes a ProbabilityDistribution that contains the frequencies with which words start the tweets of the training data.

If we print the Markov Chain resulting from the training data above, we see:

  ILLUSTRATIVE EXAMPLE MARKOV CHAIN:
  startTokens: { "a":2 }
  bigramFrequencies:
  "!":    { "and":1 }
  "?":    { "<END>":1 }
  "a":    { "banana":2  "chair":1  "table":1 }
  "and":  { "a":2 }
  "banana":   { "!":1  "?":1 }
  "chair":    { "<END>":1 }
  "table":    { "and":1 }

For this training data, because both tweets start with "a", the startTokens distribution simply records that fact.

The bigramFrequencies data structure records the information about occurrences of adjacent tokens in the tweets. For instance, the token "a" is followed by "banana" twice, "chair" once, and "table" once. The token "!" is followed by "and" once, whereas "?" (like "chair") appears only at the end of a tweet.

NOTE: we use the END_TOKEN marker "<END>" to mark the end of a tweet.

SAMPLING

Once we have trained the Markov Chain, we can use it to generate new sequences that mimic the training inputs. This is done by conducting a "random" walk through the chain.

The walk begins by choosing a token from the startTokens distribution. In the running example, since both tweets start with the token "a", the only possible starting token is "a".

Then, a sequence of tokens is generated by choosing the next symbol according to the bigram distributions until the END_TOKEN is encountered, at which point the walk is finished. For example. after the token "a", the walk might pick "chair", and then, because "chair" is always followed by the END_TOKEN, the walk is complete. A different walk might yield the sequence "a table and a banana?"

We model this random walk process as an Iterator that, given a source of (random) numbers, yields the sequence of tokens visited by choosing that "path" through the Markov Chain, as explained in more detail below.

Your job is to complete the code in this class to provide the functionality described above.

Field Summary

Fields

Modifier and Type

Field

Description

(package private) final Map<String,ProbabilityDistribution<String>>

bigramFrequencies

for each token, probability distribution of next token in a sentence

(package private) static final String

END_TOKEN

end of sentence marker

(package private) final ProbabilityDistribution<String>

startTokens

probability distribution of initial tokens in a sentence
Constructor Summary

Constructors

Constructor

Description

MarkovChain()

Construct an empty MarkovChain that can later be trained.

MarkovChain(List<List<String>> trainingData)

Construct a trained MarkovChain from the given list of training data.
Method Summary

Modifier and Type

Method

Description

(package private) void

addBigram(String first, String second)

Adds a bigram to the Markov Chain information by recording it in the appropriate probability distribution of bigramFrequencies.

void

addSequence(Iterator<String> tweet)

Adds a single tweet's training data to the Markov Chain frequency information, by: recording the first token in startTokens recording each subsequent bigram of co-occurring pairs of tokens recording a final bigram of the last token of the tweet and END_TOKEN to mark the end of the sequence

List<Integer>

findWalkChoices(List<String> tokens)

Generate a list of numbers such that if it is installed as the number generator for the MarkovChain, and used as an iterator, the tokens returned in sequence will be the list of provided tokens.

(package private) ProbabilityDistribution<String>

get(String token)

Returns the ProbabilityDistribution for a given token.

Iterator<String>

getRandomWalk()

Gets a random walk through the Markov Chain.

Iterator<String>

getWalk(NumberGenerator ng)

Gets a walk through the Markov Chain that follows the path given by the NumberGenerator.

String

toString()

Use this method to print out markov chains with tokens and probability distributions.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- startTokens
  
  final ProbabilityDistribution<String> startTokens
  
  probability distribution of initial tokens in a sentence
- bigramFrequencies
  
  final Map<String,ProbabilityDistribution<String>> bigramFrequencies
  
  for each token, probability distribution of next token in a sentence
- END_TOKEN
  
  static final String END_TOKEN
  
  end of sentence marker
  See Also:
  
  Constant Field Values
Constructor Details
- MarkovChain
  
  public MarkovChain()
  
  Construct an empty MarkovChain that can later be trained. This constructor is implemented for you.
- MarkovChain
  
  public MarkovChain(List<List<String>> trainingData)
  
  Construct a trained MarkovChain from the given list of training data. The training data is assumed to be non-null. Uses the addSequence(java.util.Iterator<java.lang.String>) method on each of the provided sequences. (It is recommended that you implement addBigram(java.lang.String, java.lang.String) and addSequence(java.util.Iterator<java.lang.String>) first.)
  
  Parameters:
  
  trainingData - - the input sequences of tokens from which to construct the MarkovChain
Method Details
- addBigram
  
  void addBigram(String first, String second)
  
  Adds a bigram to the Markov Chain information by recording it in the appropriate probability distribution of bigramFrequencies. (If this is the first time that first has appeared in a bigram, creates a new probability distribution first.) You may assume that the two Strings in the bigram are tokens that have been produced by TweetParser.
  
  Parameters:
  
  first - The first token of the Bigram (should not be null)
  
  second - The second token of the Bigram (should not be null)
  
  Throws:
  
  IllegalArgumentException - - when either parameter is null
- addSequence
  
  public void addSequence(Iterator<String> tweet)
  Adds a single tweet's training data to the Markov Chain frequency information, by:
  
  recording the first token in startTokens
  recording each subsequent bigram of co-occurring pairs of tokens
  recording a final bigram of the last token of the tweet and END_TOKEN to mark the end of the sequence
  
  Does nothing if the tweet is empty.
  Note that this is where you train the MarkovChain.
  Parameters:
  
  tweet - an iterator representing one tweet of training data
  
  Throws:
  
  IllegalArgumentException - when the tweet Iterator is null
- get
  
  ProbabilityDistribution<String> get(String token)
  
  Returns the ProbabilityDistribution for a given token. Returns null if none exists. This function is implemented for you.
  
  Parameters:
  
  token - - the token for which the ProbabilityDistribution is sought
  
  Returns:
  
  a ProbabilityDistribution
  
  Throws:
  
  IllegalArgumentException - - when parameter is null
- getWalk
  
  public Iterator<String> getWalk(NumberGenerator ng)
  
  Gets a walk through the Markov Chain that follows the path given by the NumberGenerator. See MarkovChainIterator for the details. This function is implemented for you.
  
  Parameters:
  
  ng - the path to follow (represented as a NumberGenerator, assumed nonnull)
  
  Returns:
  
  an Iterator that yields the tokens on that path
- getRandomWalk
  
  public Iterator<String> getRandomWalk()
  
  Gets a random walk through the Markov Chain. This function is implemented for you.
  
  Returns:
  
  an Iterator that yields the tokens on that path
- findWalkChoices
  
  public List<Integer> findWalkChoices(List<String> tokens)
  
  Generate a list of numbers such that if it is installed as the number generator for the MarkovChain, and used as an iterator, the tokens returned in sequence will be the list of provided tokens. Note that the length of the list of numbers is equal to the length of the list of tokens plus one (for the END_TOKEN). This function is implemented for you.
  Parameters:
  
  tokens - an ordered list of tokens that the MarkovChain should generate
  
  Returns:
  
  a list of integers representing a walk through the Markov Chain that produces the given sequence of tokens
  
  Throws:
  
  IllegalArgumentException - when any of the following are true:
  
  tokens is null or empty
  the first token in the list is not in startTokens
  any of the tokens in the list is not found as a key in the chain
  if the last token of the list cannot transition to END_TOKEN
- toString
  
  public String toString()
  
  Use this method to print out markov chains with tokens and probability distributions. This function is implemented for you.
  
  Overrides:
  
  toString in class Object

Class MarkovChain

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

startTokens

bigramFrequencies

END_TOKEN

Constructor Details

MarkovChain

MarkovChain

Method Details

addBigram

addSequence

get

getWalk

getRandomWalk

findWalkChoices

toString