Class TweetParser
public class TweetParser extends Object
Note: TweetParser's public methods are csvFileToTrainingData() and getPunctuation(). These are the only methods that other classes should call. All of the other methods provided are helper methods that build up the code you'll need to write those public methods. They have "package" (default, no modifier) visibility, which lets us write test cases for them as long as those test cases are in the same package.
-
Field Summary
Fields Modifier and Type Field Description private static String
BADWORD_REGEX
Regular Expressionsprivate static char[]
PUNCS
Valid punctuation marks.private static String
URL_REGEX
-
Constructor Summary
Constructors Constructor Description TweetParser()
-
Method Summary
Modifier and Type Method Description (package private) static String
cleanWord(String word)
Do not modify this method.static List<List<String>>
csvFileToTrainingData(String pathToCSVFile, int tweetColumn)
Given a path to a CSV file and the column from which to extract the tweet data, computes a training set.(package private) static List<String>
csvFileToTweets(String pathToCSVFile, int tweetColumn)
Given the argument pathToCSVFile and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the CSV.(package private) static String
extractColumn(String csvLine, int csvColumn)
Given a String that represents a line extracted from a CSV file and an int that represents the column of the CSV file that we want to extract from, return the contents of that column from the String.static char[]
getPunctuation()
(package private) static List<String>
parseAndCleanSentence(String sentence)
Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.(package private) static List<List<String>>
parseAndCleanTweet(String tweet)
Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words.(package private) static String
removeURLs(String s)
Do not modify this method(package private) static String
replacePunctuation(String tweet)
Do not modify this method.(package private) static List<String>
sentenceSplit(String tweet)
Do not modify this method.
-
Field Details
-
BADWORD_REGEX
Regular ExpressionsFor the purposes of this project, we consider "word characters" to be alpha-numeric characters [a-zA-Z0-9] and apostrophes [']. A word is "bad" if it contains some other character. (In particular, twitter mentions like "@user" are "bad".)
The regular expression BADWORD_REGEX expresses those constraints -- any String that matches it is considered "bad" and will be removed from the training data.
The regular expression
"[\\W&&[^']]"
matches non-word characters. The regular expression ".*" matches _any_ sequence of characters. When concatenated into the full regular expression, they match any sequence of characters followed by a non-word character followed again by any sequence of characters, or, any string containing a non-word character.Similarly, the URL_REGEX matches any substring that starts a word with "http" and continues until some whitespace occurs. See the removeURLs static method.
See https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for more details about Java's regular expressions.
tldr: use word.matches(BADWORD_REGEX) to determine if word is a bad String.
- See Also:
- Constant Field Values
-
URL_REGEX
- See Also:
- Constant Field Values
-
PUNCS
private static final char[] PUNCSValid punctuation marks.
-
-
Constructor Details
-
TweetParser
public TweetParser()
-
-
Method Details
-
removeURLs
Do not modify this methodGiven a String, remove all substrings that look like a URL. Any word that begins with the character sequence 'http' is simply replaced with the empty string.
- Parameters:
s
- - a String from which URL-like words should be removed- Returns:
- s where each "URL-like" string has been deleted
-
cleanWord
Do not modify this method.Cleans a word by removing leading and trailing whitespace and converting it to lower case. If the word matches the BADWORD_REGEX or is the empty String, returns null instead.
- Parameters:
word
- - a (non-null) String to clean- Returns:
- - a trimmed, lowercase version of the word if it contains no illegal characters and is not empty, and null otherwise.
-
getPunctuation
public static char[] getPunctuation()- Returns:
- an array containing the punctuation marks used by the parser.
-
replacePunctuation
Do not modify this method.Given a string, replaces all of the punctuation with periods.
- Parameters:
tweet
- - a String representing a tweet- Returns:
- A String with all of the punctuation replaced with periods
-
sentenceSplit
Do not modify this method.Given a tweet, splits the tweet into sentences (without end punctuation) and inserts each sentence into a list.
Use this as a helper function for parseAndCleanTweet().
- Parameters:
tweet
- - a String representing a tweet- Returns:
- A List of Strings where each String is a (non-empty) sentence from the tweet
-
extractColumn
Given a String that represents a line extracted from a CSV file and an int that represents the column of the CSV file that we want to extract from, return the contents of that column from the String. Columns in the CSV file are zero indexed.You may find the String.split() method useful here. Your solution should be relatively short.
You may assume that the column contents themselves don't have any commas.
- Parameters:
csvLine
- - a line extracted from a CSV filecsvColumn
- - the column of the line whose contents ought to be returned- Returns:
- the portion of csvLine corresponding to the column of csvColumn. If the csvLine is null or has no appropriate csvColumn, return null
-
csvFileToTweets
Given the argument pathToCSVFile and the column that the tweets are in, use the extractColumn and a FileLineIterator to extract every tweet from the CSV. (Recall that extractColumn returns null if there is no data at that column.) You should skip lines in the CSV for which the tweetColumn is out of bounds.- Parameters:
pathToCSVFile
- - a String representing a path to a CSV file containing tweetstweetColumn
- - the number of the column in the CSV file that contains the tweet- Returns:
- a List of tweet Strings, none of which are null (but that are not yet cleaned)
- Throws:
IllegalArgumentException
- if pathToCSVFile is null or if the file doesn't exist
-
parseAndCleanSentence
Splits a String representing a sentence into a sequence of words, filtering out any "bad" words from the sentence.Hint: use the String split method and the cleanWord helper defined above. You should be splitting on one space of whitespace since words are delimited by spaces.
- Parameters:
sentence
- - a (non-null) String representing one sentence with no end punctuation from a tweet- Returns:
- a (non-null) list of clean words in the order they appear in the sentence. Any "bad" words are just dropped.
-
parseAndCleanTweet
Processes a tweet in to a list of sentences, where each sentence is itself a (non-empty) list of cleaned words. Before breaking up the tweet into sentences, this method uses removeURLs to sanitize the tweet.Hint: use removeURLs followed by sentenceSplit and parseAndCleanSentence
- Parameters:
tweet
- - a String that will be split into sentences, each of which is cleaned as described above (assumed to be non-null)- Returns:
- a (non-null) list of sentences, each of which is a (non-empty) sequence of clean words drawn from the tweet.
-
csvFileToTrainingData
Given a path to a CSV file and the column from which to extract the tweet data, computes a training set. The training set is a list of sentences, each of which is a list of words. The sentences have been cleaned up by removing URLs and non-word characters, putting all words into lower case, and stripping out punctuation.- Parameters:
pathToCSVFile
- - a String representing a path to a CSV file containing tweetstweetColumn
- - the number of the column in the CSV file that contains the tweet- Returns:
- a list of training data examples
- Throws:
IllegalArgumentException
- if pathToCSVFile is null or if the file doesn't exist
-