Getting Started
I'll mostly be following along with the online book Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper while referring to Machine Learning in Python by Michael Bowles, but I'll be aiming to apply the lessons to problems that interest me personally. You can find the code associated with this blog series on my GitHub.
The ultimate goal of this machine learning practice is to produce a binary classifier for determining if a news article is related to natural resource conservation. I hope to use this classifier to learn more about current trends in the field of conservation that may also reveal interesting trends in how grant money is allocated across different study areas.
Examining the Problem
Ultimately I'll be using text processing to determine the subject of the article. One of the challenges involved in text processing is dealing with the large number of attributes. Essentially, every word in an article could be considered it's own attribute, and you could assign an even greater number of attributes to a machine learning problem if you also decide to include combinations of words and phrases.
Generally speaking, when a machine learning problem has many attributes, you'll need to feed the algorithm more training data in order to build a reliable model. This is a problem in its own right, since I don't yet have a training dataset. Essentially, I'll need to manually find articles I consider to be related to natural resource conservation and label them as such. The words used in these articles can then be extracted so the algorithm can begin to associate them the category of conservation.
Some algorithms are better than others depending on the number of attributes and the size of the dataset you can supply it. Since we already know there will be lots of attributes and I likely won't have the time to generate a huge list of articles manually labeled as conservation-related, there's a good chance a penalized linear regression algorithm will be our best choice.
Setting Up
I'll be using python 3.5.3. For a full list of other requirements, check out the requirements.txt file on the GitHub repo.
First though, I'll be using the NLTK library in a Jupyter notebook:
pip install nltk
pip install jupyter
Using an Example Dataset
Before trying to create my own dataset of classified news articles, it probably makes more sense to use an existing example dataset so we can get a better idea of how to organize our own.
To get some example data, you can use the examples from the nltk book. In your interpreter, simple import nltk and use the download_gui() function. I'll only be downloading the "book" collection for now.
These examples will give us a lot to work with but "text7" is probably the most relevant to our ultimate goal since it is an extract of the Wall Street Journal.
Exploring the Data
We can first get an idea of how much information we have by using the len() function on our text data:
from nltk.book import text7
print(len(text7))
We find out that there are 100676 words and punctuation marks contained in the text7 object. Each of these items is known as a token, the majority of which won't be very useful in determining the genre of the content. We can use some of the features of the nltk library to begin paring down the data into just the useful stuff.
Logically, if we want to extract to subject of a body of text, we can probably eliminate things like punctuation and duplicate words. Such a process will also greatly reduce the amount of data we ultimately have to save and sort through when we create our own dataset. That said, the frequency of word occurrence could provide valuable information about the nature of the text. We can compromise by using something like this to try to extract the more meaningful words:
from nltk import FreqDist
fdist = FreqDist(text7)
# Create a list of hopefully important words
focused_words = sorted(w for w in set(text7) if len(w) > 5 and fdist[w] > 7)
print(focused_words)
Here we create a sorted list of words with more than 5 characters in length and occur in the text more than 7 times. Browsing through the list, we can see that we've successfully eliminated the majority of "filler" words in the English language and have been left mostly with words that are more meaningful.