Building a Dataset
We were introduced to a few basic concepts in part one of this series, but going forward it will probably be easier to start experimenting with data that is actually relevant to the problem that we want to solve. As refresher, that problem is scan news articles and determine if they are relevant to natural resource conservation.
To start simple, we'll grab two articles; one about some sort of ecological conservation topic, and another that has no relation to the topic. To do this, we'll use the newpaper library (more specifically the newpaper3k library since I'm using Python 3):
from newspaper import Article
# Create the article objects
conservation_article = Article('http://www.sltrib.com/news/environment/2017/11/24/restoring-utahs-damaged-landscapes-ephraim-center-provides-vital-seed-stocks-to-heal-public-lands/')
regular_article = Article('http://www.sltrib.com/news/nation-world/2017/11/26/guns-were-black-friday-must-haves-going-by-the-fbis-record-203086-background-check-requests/')
Now that we have chosen our articles, we need to download and parse them before we do much with them:
# Download and parse the articles
conservation_article.download()
regular_article.download()
conservation_article.parse()
regular_article.parse()
At this point we can use some tools built into the newspaper to easily extract information that might be important to us. We'll start by just taking a look at the summary data:
# Use newspapers built-in natural language processing
conservation_article.nlp()
regular_article.nlp()
# Now we have should have access to some interesting info
print('Conservation article summary:', conservation_article.summary)
print('\nRegular article summary:', regular_article.summary)
While the summaries themselves aren't perfectly sensible from a human-readable standpoint, they do manage to pull out some key ideas that clearly illustrate the differences between the two articles. In the summary of the conservation article we can also see some words and phrases that will likely be important for classifying conservation articles (e.g. 'rehabilitation', 'reseeding', 'habitat restoration').
At this point it makes sense to start gathering some articles and manually labelling them so we can get a better sense of which words are more important for training a classifier. To do this, we'll make some small changes and additions to the way we gather articles. For one, we'll create two list to divide the articles that we gather:
from newspaper import Article
conservation_articles = []
regular_articles = []
# Conservation articles
conservation_article = Article('http://www.sltrib.com/news/environment/2017/11/24/restoring-utahs-damaged-landscapes-ephraim-center-provides-vital-seed-stocks-to-heal-public-lands/')
conservation_articles.append(conservation_article)
con_art1 = Article('https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/newsroom/features/?cid=nrcseprd1367450')
conservation_articles.append(con_art1)
# Regular articles
regular_article = Article('http://www.sltrib.com/news/nation-world/2017/11/26/guns-were-black-friday-must-haves-going-by-the-fbis-record-203086-background-check-requests/')
regular_articles.append(regular_article)
reg_art1 = Article('http://thehill.com/homenews/administration/362121-democrats-pull-out-of-white-house-meeting-with-trump')
regular_articles.append(reg_art1)
This will make it easy to label the articles by type in the form of tuples:
labeled_articles = [('conservation', article) for article in conservation_articles]
labeled_articles += [('regular', article) for article in regular_articles]
print(labeled_articles)
And now we can neatly organize the download and parsing of all of our articles at once:
for label, article in labeled_articles:
article.download()
article.parse()
We'll want to keep adding to this list of articles, being careful to keep the two lists roughly the same length. While it would be easy to populate our list of regular articles by just grabbing whatever we can find from a random news source, we would end up with a highly imbalanced dataset. Training a classifier on such imbalanced data typically results in poor predictions since the classifier can just predict the majority class to achieve a high accuracy. We want to use a balanced number of labels so the classifier can focus in on the things that are actually informative to the problem; the words in the article.