Class MarkovChain
TRAINING:
An ILLUSTRATIVE EXAMPLE (see the corresponding test cases throughout the
project files). Suppose we want to train the MarkovChain on these two
String
s that represent tweets:
"a table and a chair"
and
"a banana! and a banana?"
We first parse these two tweets into sequences of String
tokens that represent the individual words and punctuation marks
of the tweet. (See TweetParser.parseAndCleanTweet(String)
.)
For these two tweets, we get the lists below. (Note how
the punctuation is separated as individual tokens.)
["a", "table", "and", "a", "chair"]
["a", "banana", "!", "and", "a", "banana", "?"]
The MarkovChain
that results from this training data maps each observed
token to a ProbabilityDistribution
that is based on the recorded occurrences
of bigrams (adjacent tokens) in the data. The MarkovChain
also
computes a ProbabilityDistribution
that contains the frequencies with which words
start the tweets of the training data.
If we print the Markov Chain resulting from the training data above, we see:
ILLUSTRATIVE EXAMPLE MARKOV CHAIN: startTokens: { "a":2 } bigramFrequencies: "!": { "and":1 } "?": { "<END>":1 } "a": { "banana":2 "chair":1 "table":1 } "and": { "a":2 } "banana": { "!":1 "?":1 } "chair": { "<END>":1 } "table": { "and":1 }
For this training data, because both tweets start with "a", the startTokens
distribution simply records that fact.
The bigramFrequencies
data structure records the information about
occurrences of adjacent tokens in the tweets. For instance,
the token "a" is followed by "banana" twice, "chair" once, and "table" once.
The token "!" is followed by "and" once, whereas "?" (like "chair") appears
only at the end of a tweet.
NOTE: we use the END_TOKEN
marker "<END>"
to mark the end of a tweet.
SAMPLING
Once we have trained the Markov Chain, we can use it to generate new sequences that mimic the training inputs. This is done by conducting a "random" walk through the chain.
The walk begins by choosing a token from the startTokens
distribution.
In the running example, since both tweets start with the token "a", the only
possible starting token is "a".
Then, a sequence of tokens is generated by choosing the next symbol according to
the bigram distributions until the END_TOKEN
is encountered, at which point the walk
is finished. For example. after the token "a", the walk might pick "chair", and then,
because "chair" is always followed by the END_TOKEN
, the walk is complete. A different
walk might yield the sequence "a table and a banana?"
We model this random walk process as an Iterator
that, given a source of
(random) numbers, yields the sequence of tokens visited by choosing that "path" through
the Markov Chain, as explained in more detail below.
Your job is to complete the code in this class to provide the functionality described above.
-
Field Summary
FieldsModifier and TypeFieldDescription(package private) final Map
<String, ProbabilityDistribution<String>> for each token, probability distribution of next token in a sentence(package private) static final String
end of sentence marker(package private) final ProbabilityDistribution
<String> probability distribution of initial tokens in a sentence -
Constructor Summary
ConstructorsConstructorDescriptionConstruct an emptyMarkovChain
that can later be trained.MarkovChain
(List<List<String>> trainingData) Construct a trainedMarkovChain
from the given list of training data. -
Method Summary
Modifier and TypeMethodDescription(package private) void
Adds a bigram to the Markov Chain information by recording it in the appropriate probability distribution ofbigramFrequencies
.void
addSequence
(Iterator<String> tweet) Adds a single tweet's training data to the Markov Chain frequency information, by: recording the first token instartTokens
recording each subsequent bigram of co-occurring pairs of tokens recording a final bigram of the last token of thetweet
andEND_TOKEN
to mark the end of the sequencefindWalkChoices
(List<String> tokens) Generate a list of numbers such that if it is installed as the number generator for the MarkovChain, and used as an iterator, the tokens returned in sequence will be the list of provided tokens.(package private) ProbabilityDistribution
<String> Returns the ProbabilityDistribution for a given token.Gets a random walk through the Markov Chain.Gets a walk through the Markov Chain that follows the path given by theNumberGenerator
.toString()
Use this method to print out markov chains with tokens and probability distributions.
-
Field Details
-
startTokens
probability distribution of initial tokens in a sentence -
bigramFrequencies
for each token, probability distribution of next token in a sentence -
END_TOKEN
end of sentence marker- See Also:
-
-
Constructor Details
-
MarkovChain
public MarkovChain()Construct an emptyMarkovChain
that can later be trained. This constructor is implemented for you. -
MarkovChain
Construct a trainedMarkovChain
from the given list of training data. The training data is assumed to be non-null. Uses theaddSequence(java.util.Iterator<java.lang.String>)
method on each of the provided sequences. (It is recommended that you implementaddBigram(java.lang.String, java.lang.String)
andaddSequence(java.util.Iterator<java.lang.String>)
first.)- Parameters:
trainingData
- - the input sequences of tokens from which to construct theMarkovChain
-
-
Method Details
-
addBigram
Adds a bigram to the Markov Chain information by recording it in the appropriate probability distribution ofbigramFrequencies
. (If this is the first time thatfirst
has appeared in a bigram, creates a new probability distribution first.) You may assume that the two Strings in the bigram are tokens that have been produced by TweetParser.- Parameters:
first
- The first token of the Bigram (should not be null)second
- The second token of the Bigram (should not be null)- Throws:
IllegalArgumentException
- - when either parameter is null
-
addSequence
Adds a single tweet's training data to the Markov Chain frequency information, by:- recording the first token in
startTokens
- recording each subsequent bigram of co-occurring pairs of tokens
- recording a final bigram of the last token of the
tweet
andEND_TOKEN
to mark the end of the sequence
Does nothing if the tweet is empty.
Note that this is where you train the MarkovChain.
- Parameters:
tweet
- an iterator representing one tweet of training data- Throws:
IllegalArgumentException
- when the tweet Iterator is null
- recording the first token in
-
get
Returns the ProbabilityDistribution for a given token. Returns null if none exists. This function is implemented for you.- Parameters:
token
- - the token for which the ProbabilityDistribution is sought- Returns:
- a ProbabilityDistribution
- Throws:
IllegalArgumentException
- - when parameter is null
-
getWalk
Gets a walk through the Markov Chain that follows the path given by theNumberGenerator
. SeeMarkovChainIterator
for the details. This function is implemented for you.- Parameters:
ng
- the path to follow (represented as aNumberGenerator
, assumed nonnull)- Returns:
- an
Iterator
that yields the tokens on that path
-
getRandomWalk
Gets a random walk through the Markov Chain. This function is implemented for you.- Returns:
- an
Iterator
that yields the tokens on that path
-
findWalkChoices
Generate a list of numbers such that if it is installed as the number generator for the MarkovChain, and used as an iterator, the tokens returned in sequence will be the list of provided tokens. Note that the length of the list of numbers is equal to the length of the list of tokens plus one (for theEND_TOKEN
). This function is implemented for you.- Parameters:
tokens
- an ordered list of tokens that the MarkovChain should generate- Returns:
- a list of integers representing a walk through the Markov Chain that produces the given sequence of tokens
- Throws:
IllegalArgumentException
- when any of the following are true:-
tokens
is null or empty - the first token in the list is not in
startTokens
- any of the tokens in the list is not found as a key in the chain
- if the last token of the list cannot transition to
END_TOKEN
-
-
toString
Use this method to print out markov chains with tokens and probability distributions. This function is implemented for you.
-