Package org.cis1200

Class TweetParser

java.lang.Object
org.cis1200.TweetParser

public class TweetParser extends Object
This class provides the method for turning a raw String into a sequence of tokens to be used for training the Markov Chain.

There is no code you need to write for this class, but it can be helpful to understand how the tokens are created. It is used in TwitterBotMain.main(String[]).

  • Field Details

    • WORD_TOKEN

      static final String WORD_TOKEN
      Regular Expressions

      For the purposes of this project, we consider "word characters" to be alphanumeric characters [a-zA-Z0-9] and apostrophes ['], hashes [#], and [@]. (We use those symbols so that "don't" "#hashtag" and "@user" are parsed as single tokens.)

      A token is either a WORD_TOKEN, which is a sequence of word characters, or a PUNCTUATION_TOKEN, like "!" or "." . Strings matching these constraints are described using regular expressions that the Pattern class uses to find matching substrings. See that documentation for more details.

      The URL_REGEX matches any substring that starts a word with "http" or "https" and continues until some whitespace occurs. It is used in the removeURLs(String) static method.

      See Also:
    • PUNCTUATION_TOKEN

      static final String PUNCTUATION_TOKEN
      See Also:
    • TOKEN

      static final String TOKEN
      See Also:
    • URL_REGEX

      static final String URL_REGEX
      See Also:
  • Constructor Details

    • TweetParser

      public TweetParser()
  • Method Details

    • removeURLs

      static String removeURLs(String s)
      Given a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.
      Parameters:
      s - - a String from which URL-like words should be removed
      Returns:
      s where each "URL-like" string has been deleted
    • parseAndCleanTweet

      static List<String> parseAndCleanTweet(String tweet)
      Converts a string into a list of training tokens by first removing any URLs and then breaking up the string at any whitespace.
      Parameters:
      tweet - a single String to be used as a source of training data tokens
      Returns:
      a list of tokens
    • rawTweetsToTrainingData

      public static List<List<String>> rawTweetsToTrainingData(List<String> rawTweets)
      Applies the parseAndCleanTweet to a list of raw input tweets, returning each cleaned tweet as a list of tokens to be used as training data.

      If, after cleaning, a raw tweet has no tokens (i.e., is empty), it is ignored and does not contribute to the training data.

      Parameters:
      rawTweets - a list of Strings to be parsed and cleaned as tweets
      Returns:
      a list of training data examples