Class PostParser
String
into a
sequence of tokens to be used for training the Markov Chain.
There is no code you need to write for this class, but it can be helpful to
understand how the tokens are created. It is used in ChatterBotMain.main(String[])
.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionparseAndCleanPost
(String post) Converts a string into a list of training tokens by first removing any URLs and then breaking up the string at any whitespace.rawPostsToTrainingData
(List<String> rawPosts) Applies theparseAndCleanPost
to a list of raw input posts, returning each cleaned post as a list of tokens to be used as training data.(package private) static String
removeURLs
(String s) Given a String, remove all substrings that look like a URL.
-
Field Details
-
WORD_TOKEN
Regular ExpressionsFor the purposes of this project, we consider "word characters" to be alphanumeric characters [a-zA-Z0-9] and apostrophes ['], hashes [#], and [@]. (We use those symbols so that "don't" "#hashtag" and "@user" are parsed as single tokens.)
A token is either a
WORD_TOKEN
, which is a sequence of word characters, or aPUNCTUATION_TOKEN
, like "!" or "." . Strings matching these constraints are described using regular expressions that thePattern
class uses to find matching substrings. See that documentation for more details.The
URL_REGEX
matches any substring that starts a word with "http" or "https" and continues until some whitespace occurs. It is used in theremoveURLs(String)
static method.- See Also:
-
PUNCTUATION_TOKEN
- See Also:
-
TOKEN
- See Also:
-
URL_REGEX
- See Also:
-
-
Constructor Details
-
PostParser
public PostParser()
-
-
Method Details
-
removeURLs
Given a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.- Parameters:
s
- - a String from which URL-like words should be removed- Returns:
- s where each "URL-like" string has been deleted
-
parseAndCleanPost
Converts a string into a list of training tokens by first removing any URLs and then breaking up the string at any whitespace.- Parameters:
post
- a single String to be used as a source of training data tokens- Returns:
- a list of tokens
-
rawPostsToTrainingData
Applies theparseAndCleanPost
to a list of raw input posts, returning each cleaned post as a list of tokens to be used as training data.If, after cleaning, a raw post has no tokens (i.e., is empty), it is ignored and does not contribute to the training data.
- Parameters:
rawPosts
- a list ofStrings
to be parsed and cleaned as posts- Returns:
- a list of training data examples
-