Text Stats
In this lab, we’ll practice our IO skills by writing a program to analyze the structure of a text file.
Write a program in a file textstats.py
that prompts the user for a filename, reads in that file as input, and then prints basic stats about that file:
- The number of lines in the text file.
- The total number of words/tokens in the text file.
- The total number of characters (not including whitespace) in the text file.
- The average word/token size in the text file.
- The top 5 most common words/tokens in the text file.
For example, Project Gutenberg provides a number of classic books and novels in text format for free. Running textstats
on The Adventures of Huckleberry Finn yields the following output
Filename? pg76.txt
Statistics for pg76.txt
Number of lines: 12361
Number of words: 114266
Number of characters: 480840
Average word size: 4
Five most frequent words:
[('and', 6050), ('the', 4708), ('a', 2935), ('to', 2903), ('I', 2476)]
Your program should mimic this output exactly. To calculate the five most frequent words, use a dictionary. You will find the built-in sorted() function useful for this task.
Your program should also test if the filename given (1) exists and (2) is the name of a valid file (versus a directory). If one of these conditions is not met, you should inform the user and exit the program. For example:
> python textstats.py
Please enter a filename to analyze: doesnotexist.txt
File doestnotexist.txt does not exist!
> python textstats.py
Please enter a filename to analyze: isadir
File doestnotexist.txt does not exist!
Use the exists
and isfile
functions of the os.path
module to accomplish this goal.
Once you are done, you should try your program a variety of texts from gutenberg. Try your program on these books: