Class TweetParser
String
into a
sequence of tokens to be used for training the Markov Chain.
There is no code you need to write for this class, but it can be helpful to
understand how the tokens are created. It is used in TwitterBotMain.main(String[])
.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionparseAndCleanTweet
(String tweet) Converts a string into a list of training tokens by first removing any URLs and then breaking up the string at any whitespace.rawTweetsToTrainingData
(List<String> rawTweets) Applies theparseAndCleanTweet
to a list of raw input tweets, returning each cleaned tweet as a list of tokens to be used as training data.(package private) static String
removeURLs
(String s) Given a String, remove all substrings that look like a URL.
-
Field Details
-
WORD_TOKEN
Regular ExpressionsFor the purposes of this project, we consider "word characters" to be alphanumeric characters [a-zA-Z0-9] and apostrophes ['], hashes [#], and [@]. (We use those symbols so that "don't" "#hashtag" and "@user" are parsed as single tokens.)
A token is either a
WORD_TOKEN
, which is a sequence of word characters, or aPUNCTUATION_TOKEN
, like "!" or "." . Strings matching these constraints are described using regular expressions that thePattern
class uses to find matching substrings. See that documentation for more details.The
URL_REGEX
matches any substring that starts a word with "http" or "https" and continues until some whitespace occurs. It is used in theremoveURLs(String)
static method.- See Also:
-
PUNCTUATION_TOKEN
- See Also:
-
TOKEN
- See Also:
-
URL_REGEX
- See Also:
-
-
Constructor Details
-
TweetParser
public TweetParser()
-
-
Method Details
-
removeURLs
Given a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.- Parameters:
s
- - a String from which URL-like words should be removed- Returns:
- s where each "URL-like" string has been deleted
-
parseAndCleanTweet
Converts a string into a list of training tokens by first removing any URLs and then breaking up the string at any whitespace.- Parameters:
tweet
- a single String to be used as a source of training data tokens- Returns:
- a list of tokens
-
rawTweetsToTrainingData
Applies theparseAndCleanTweet
to a list of raw input tweets, returning each cleaned tweet as a list of tokens to be used as training data.If, after cleaning, a raw tweet has no tokens (i.e., is empty), it is ignored and does not contribute to the training data.
- Parameters:
rawTweets
- a list ofStrings
to be parsed and cleaned as tweets- Returns:
- a list of training data examples
-