Class TweetParser

java.lang.Object
TweetParser

public class TweetParser
extends Object
TweetParser.csvFileToTrainingData() takes in a CSV file that contains tweets and iterates through the file, one tweet at a time, removing parts of the tweets that would be bad inputs to MarkovChain (for example, a URL). It then parses tweets into sentences and returns those sentences as lists of cleaned-up words.

Note: TweetParser's public methods are csvFileToTrainingData() and getPunctuation(). These are the only methods that other classes should call. All of the other methods provided are helper methods that build up the code you'll need to write those public methods. They have "package" (default, no modifier) visibility, which lets us write test cases for them as long as those test cases are in the same package.

  • Field Details

    • BADWORD_REGEX

      private static final String BADWORD_REGEX
      Regular Expressions

      For the purposes of this project, we consider "word characters" to be alpha-numeric characters [a-zA-Z0-9] and apostrophes [']. A word is "bad" if it contains some other character. (In particular, twitter mentions like "@user" are "bad".)

      The regular expression BADWORD_REGEX expresses those constraints -- any String that matches it is considered "bad" and will be removed from the training data.

      The regular expression "[\\W&&[^']]" matches non-word characters. The regular expression ".*" matches _any_ sequence of characters. When concatenated into the full regular expression, they match any sequence of characters followed by a non-word character followed again by any sequence of characters, or, any string containing a non-word character.

      Similarly, the URL_REGEX matches any substring that starts a word with "http" and continues until some whitespace occurs. See the removeURLs static method.

      See https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for more details about Java's regular expressions.

      tldr: use word.matches(BADWORD_REGEX) to determine if word is a bad String.

      See Also:
      Constant Field Values
    • URL_REGEX

      private static final String URL_REGEX
      See Also:
      Constant Field Values
    • PUNCS

      private static final char[] PUNCS
      Valid punctuation marks.
  • Constructor Details

    • TweetParser

      public TweetParser()
  • Method Details

    • removeURLs

      static String removeURLs​(String s)
      Do not modify this method

      Given a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.

      Parameters:
      s - - a String from which URL-like words should be removed
      Returns:
      s where each "URL-like" string has been deleted
    • cleanWord

      static String cleanWord​(String word)
      Do not modify this method.

      Cleans a word by removing leading and trailing whitespace and converting it to lower case. If the word matches the BADWORD_REGEX or is the empty String, returns null instead.

      Parameters:
      word - - a (non-null) String to clean
      Returns:
      - a trimmed, lowercase version of the word if it contains no illegal characters and is not empty, and null otherwise.
    • getPunctuation

      public static char[] getPunctuation()
      Returns:
      an array containing the punctuation marks used by the parser.
    • replacePunctuation

      static String replacePunctuation​(String tweet)
      Do not modify this method.

      Given a string, replaces all of the punctuation with periods.

      Parameters:
      tweet - - a String representing a tweet
      Returns:
      A String with all of the punctuation replaced with periods
    • sentenceSplit

      static List<String> sentenceSplit​(String tweet)
      Do not modify this method.

      Given a tweet, splits the tweet into sentences (without end punctuation) and inserts each sentence into a list.

      Use this as a helper function for parseAndCleanTweet().

      Parameters:
      tweet - - a String representing a tweet
      Returns:
      A List of Strings where each String is a (non-empty) sentence from the tweet
    • extractColumn

      static String extractColumn​(String csvLine, int csvColumn)
      Given a String that represents a line extracted from a CSV file and an int that represents the column of the CSV file that we want to extract from, return the contents of that column from the String. Columns in the CSV file are zero indexed.

      You may find the String.split() method useful here. Your solution should be relatively short.

      You may assume that the column contents themselves don't have any commas.

      Parameters:
      csvLine - - a line extracted from a CSV file
      csvColumn - - the column of the line whose contents ought to be returned
      Returns:
      the portion of csvLine corresponding to the column of csvColumn. If the csvLine is null or has no appropriate csvColumn, return null
    • csvFileToTweets

      static List<String> csvFileToTweets​(String pathToCSVFile, int tweetColumn)
      Given the argument pathToCSVFile and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the CSV. (Recall that extractColumn returns null if there is no data at that column.) You should skip lines in the CSV for which the tweetColumn is out of bounds.
      Parameters:
      pathToCSVFile - - a String representing a path to a CSV file containing tweets
      tweetColumn - - the number of the column in the CSV file that contains the tweet
      Returns:
      a List of tweet Strings, none of which are null (but that are not yet cleaned)
      Throws:
      IllegalArgumentException - if pathToCSVFile is null or if the file doesn't exist
    • parseAndCleanSentence

      static List<String> parseAndCleanSentence​(String sentence)
      Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.

      Hint: use the String split method and the cleanWord helper defined above. You should be splitting on one space of whitespace since words are delimited by spaces.

      Parameters:
      sentence - - a (non-null) String representing one sentence with no end punctuation from a tweet
      Returns:
      a (non-null) list of clean words in the order they appear in the sentence. Any "bad" words are just dropped.
    • parseAndCleanTweet

      static List<List<String>> parseAndCleanTweet​(String tweet)
      Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words. Before breaking up the tweet into sentences, this method uses removeURLs to sanitize the tweet.

      Hint: use removeURLs followed by sentenceSplit and parseAndCleanSentence

      Parameters:
      tweet - - a String that will be split into sentences, each of which is cleaned as described above (assumed to be non-null)
      Returns:
      a (non-null) list of sentences, each of which is a (non-empty) sequence of clean words drawn from the tweet.
    • csvFileToTrainingData

      public static List<List<String>> csvFileToTrainingData​(String pathToCSVFile, int tweetColumn)
      Given a path to a CSV file and the column from which to extract the tweet data, computes a training set. The training set is a list of sentences, each of which is a list of words. The sentences have been cleaned up by removing URLs and non-word characters, putting all words into lower case, and stripping out punctuation.
      Parameters:
      pathToCSVFile - - a String representing a path to a CSV file containing tweets
      tweetColumn - - the number of the column in the CSV file that contains the tweet
      Returns:
      a list of training data examples
      Throws:
      IllegalArgumentException - if pathToCSVFile is null or if the file doesn't exist