In the past couple of week I’ve been helping to update the curriculum for a fantastic project called DH Bridge. This curriculum includes a one-day programming bootcamp for people with no computer science experience (and particularly those who are also involved in the humanities) to learn some basic Python skills. I’ve had so much fun doing the tutorial along the way because it focuses on text analysis using the Natural Language Toolkit (NLTK), which I wasn’t previously familiar with, but includes some really cool tools for natural language processing. You can download NLTK for free and use the many Python libraries it has available to do text analysis day and night! Here are a few of the things I learned:
- NLTK has a built in method for getting word frequencies, and it’ll spit out the n most common words in a text (you decide what n is) along with the number of times that each word appears, in order from most to least frequent. Nothing too complicated – but it’s a great (and very useful) starting place.
- Want to see the context in which a certain word appears throughout a text? This method takes a single word as a parameter and prints out each instance of that word within its surrounding text. For example, here’s every instance of the word “trial” in Harper Lee’s To Kill a Mockingbird.
This is a great way to get a sense of how a word is being used throughout a text without having to Control+F your way through the whole thing.
- This one is my favorite because I think it’s so cool. You give it a word and it returns the twenty words that are “most similar” to that word in the text. I haven’t looked too far into how it works, but the method somehow determines which words are most often used in a similar context to the given word. For example, here are the results for the word “trial” in To Kill a Mockingbird.
Some words, like “court” and “newspaper” are pretty self explanatory, but we may question why a word like “family” is so closely associated with the word “trial” in this novel.
Even with these very simple searches, it’s already easy to see the kind of information you can get out of a text that the human eye wouldn’t necessarily be able to see. Yay digital text analysis!