TweetParser

public class TweetParser
extends Object

TweetParser.csvFileToTrainingData() takes in a CSV file that contains tweets and iterates through the file, one tweet at a time, removing parts of the tweets that would be bad inputs to MarkovChain (for example, a URL). It then parses tweets into sentences and returns those sentences as lists of cleaned-up words.

Note: TweetParser's public methods are csvFileToTrainingData() and getPunctuation(). These are the only methods that other classes should call. All of the other methods provided are helper methods that build up the code you'll need to write those public methods. They have "package" (default, no modifier) visibility, which lets us write test cases for them as long as those test cases are in the same package.

Field Summary

Fields
Modifier and Type	Field	Description
`private static String`	`BADWORD_REGEX`	Regular Expressions
`private static char[]`	`PUNCS`	Valid punctuation marks.
`private static String`	`URL_REGEX`

Constructor Summary

Constructors

Constructor Description

TweetParser()

Method Summary

Modifier and Type	Method	Description
`(package private) static String`	`cleanWord(String word)`	Do not modify this method.
`static List<List<String>>`	`csvFileToTrainingData(String pathToCSVFile, int tweetColumn)`	Given a path to a CSV file and the column from which to extract the tweet data, computes a training set.
`(package private) static List<String>`	`csvFileToTweets(String pathToCSVFile, int tweetColumn)`	Given the argument pathToCSVFile and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the CSV.
`(package private) static String`	`extractColumn(String csvLine, int csvColumn)`	Given a String that represents a line extracted from a CSV file and an int that represents the column of the CSV file that we want to extract from, return the contents of that column from the String.
`static char[]`	`getPunctuation()`
`(package private) static List<String>`	`parseAndCleanSentence(String sentence)`	Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.
`(package private) static List<List<String>>`	`parseAndCleanTweet(String tweet)`	Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words.
`(package private) static String`	`removeURLs(String s)`	Do not modify this method
`(package private) static String`	`replacePunctuation(String tweet)`	Do not modify this method.
`(package private) static List<String>`	`sentenceSplit(String tweet)`	Do not modify this method.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- BADWORD_REGEX
  
  private static final String BADWORD_REGEX
  
  Regular Expressions
  For the purposes of this project, we consider "word characters" to be alpha-numeric characters [a-zA-Z0-9] and apostrophes [']. A word is "bad" if it contains some other character. (In particular, twitter mentions like "@user" are "bad".)
  The regular expression BADWORD_REGEX expresses those constraints -- any String that matches it is considered "bad" and will be removed from the training data.
  The regular expression "[\\W&&[^']]" matches non-word characters. The regular expression ".*" matches _any_ sequence of characters. When concatenated into the full regular expression, they match any sequence of characters followed by a non-word character followed again by any sequence of characters, or, any string containing a non-word character.
  Similarly, the URL_REGEX matches any substring that starts a word with "http" and continues until some whitespace occurs. See the removeURLs static method.
  See https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for more details about Java's regular expressions.
  tldr: use word.matches(BADWORD_REGEX) to determine if word is a bad String.
  
  See Also:
  
  Constant Field Values
- URL_REGEX
  
  private static final String URL_REGEX
  
  See Also:
  
  Constant Field Values
- PUNCS
  
  private static final char[] PUNCS
  
  Valid punctuation marks.
Constructor Details
- TweetParser
  
  public TweetParser()
Method Details
- removeURLs
  
  static String removeURLs(String s)
  
  Do not modify this method
  Given a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.
  
  Parameters:
  
  s - - a String from which URL-like words should be removed
  
  Returns:
  
  s where each "URL-like" string has been deleted
- cleanWord
  
  static String cleanWord(String word)
  
  Do not modify this method.
  Cleans a word by removing leading and trailing whitespace and converting it to lower case. If the word matches the BADWORD_REGEX or is the empty String, returns null instead.
  
  Parameters:
  
  word - - a (non-null) String to clean
  
  Returns:
  
  - a trimmed, lowercase version of the word if it contains no illegal characters and is not empty, and null otherwise.
- getPunctuation
  
  public static char[] getPunctuation()
  
  Returns:
  
  an array containing the punctuation marks used by the parser.
- replacePunctuation
  
  static String replacePunctuation(String tweet)
  
  Do not modify this method.
  Given a string, replaces all of the punctuation with periods.
  
  Parameters:
  
  tweet - - a String representing a tweet
  
  Returns:
  
  A String with all of the punctuation replaced with periods
- sentenceSplit
  
  static List<String> sentenceSplit(String tweet)
  
  Do not modify this method.
  Given a tweet, splits the tweet into sentences (without end punctuation) and inserts each sentence into a list.
  Use this as a helper function for parseAndCleanTweet().
  
  Parameters:
  
  tweet - - a String representing a tweet
  
  Returns:
  
  A List of Strings where each String is a (non-empty) sentence from the tweet
- extractColumn
  
  static String extractColumn(String csvLine, int csvColumn)
  
  Given a String that represents a line extracted from a CSV file and an int that represents the column of the CSV file that we want to extract from, return the contents of that column from the String. Columns in the CSV file are zero indexed.
  You may find the String.split() method useful here. Your solution should be relatively short.
  You may assume that the column contents themselves don't have any commas.
  
  Parameters:
  
  csvLine - - a line extracted from a CSV file
  
  csvColumn - - the column of the line whose contents ought to be returned
  
  Returns:
  
  the portion of csvLine corresponding to the column of csvColumn. If the csvLine is null or has no appropriate csvColumn, return null
- csvFileToTweets
  
  static List<String> csvFileToTweets(String pathToCSVFile, int tweetColumn)
  
  Given the argument pathToCSVFile and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the CSV. (Recall that extractColumn returns null if there is no data at that column.) You should skip lines in the CSV for which the tweetColumn is out of bounds.
  
  Parameters:
  
  pathToCSVFile - - a String representing a path to a CSV file containing tweets
  
  tweetColumn - - the number of the column in the CSV file that contains the tweet
  
  Returns:
  
  a List of tweet Strings, none of which are null (but that are not yet cleaned)
  
  Throws:
  
  IllegalArgumentException - if pathToCSVFile is null or if the file doesn't exist
- parseAndCleanSentence
  
  static List<String> parseAndCleanSentence(String sentence)
  
  Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.
  Hint: use the String split method and the cleanWord helper defined above. You should be splitting on one space of whitespace since words are delimited by spaces.
  
  Parameters:
  
  sentence - - a (non-null) String representing one sentence with no end punctuation from a tweet
  
  Returns:
  
  a (non-null) list of clean words in the order they appear in the sentence. Any "bad" words are just dropped.
- parseAndCleanTweet
  
  static List<List<String>> parseAndCleanTweet(String tweet)
  
  Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words. Before breaking up the tweet into sentences, this method uses removeURLs to sanitize the tweet.
  Hint: use removeURLs followed by sentenceSplit and parseAndCleanSentence
  
  Parameters:
  
  tweet - - a String that will be split into sentences, each of which is cleaned as described above (assumed to be non-null)
  
  Returns:
  
  a (non-null) list of sentences, each of which is a (non-empty) sequence of clean words drawn from the tweet.
- csvFileToTrainingData
  
  public static List<List<String>> csvFileToTrainingData(String pathToCSVFile, int tweetColumn)
  
  Given a path to a CSV file and the column from which to extract the tweet data, computes a training set. The training set is a list of sentences, each of which is a list of words. The sentences have been cleaned up by removing URLs and non-word characters, putting all words into lower case, and stripping out punctuation.
  
  Parameters:
  
  pathToCSVFile - - a String representing a path to a CSV file containing tweets
  
  tweetColumn - - the number of the column in the CSV file that contains the tweet
  
  Returns:
  
  a list of training data examples
  
  Throws:
  
  IllegalArgumentException - if pathToCSVFile is null or if the file doesn't exist

Class TweetParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

BADWORD_REGEX

URL_REGEX

PUNCS

Constructor Details

TweetParser

Method Details

removeURLs

cleanWord

getPunctuation

replacePunctuation

sentenceSplit

extractColumn

csvFileToTweets

parseAndCleanSentence

parseAndCleanTweet

csvFileToTrainingData