Building a Sentiment Analysis report using NLTK and Altair

Using the UCI News dataset we’ll calculate positivity of news headlines, visualize the results and package them into an interactive report

John Micah Reid
Towards Data Science

--

Photo by Glen Carrie on Unsplash

(Disclaimer: I work as a product manager at Datapane)

Data analysis often starts with structured data that’s already stored as numbers, dates, categories etc. However, unstructured data can yield crucial insights if you use appropriate techniques. In this tutorial, we'll run sentiment analysis on a textual dataset to calculate positive/negative sentiment, and turn the results into an interactive report.

Running sentiment analysis

Let’s imagine we’re a data scientist working for a news company and we’re trying to figure out how ‘positive’ our news headlines are in comparison to the industry. We’ll start with the UCI News Aggregator dataset (CC0: Public Domain) [1] which is a collection of news headlines from different publications in 2014. This is a fun dataset because it has articles from a wide range of publishers and contains useful metadata.

After downloading and cleaning up the data, we get the following result:

Image by author

We have 8 columns and about 400k rows. We’ll use the ‘Title’ for the actual sentiment analysis, and group the results by ‘Publisher’, ‘Category’ and ‘Timestamp’.

Classifying the headlines

Through the magic of open-source, we can use someone else’s hard-earned knowledge in our analysis — in this case a pretrained model called the Vader Sentiment Intensity Analyser from the popular NLTK library.

To build the model, the authors gathered a list of common words and then asked a panel of human testers to rate each one on valence i.e. positive or negative, and intensity i.e. how strong the sentiment is. As the original paper says: :

[After stripping out irrelevant words] this left us with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. For example, the word “okay” has a positive valence of 0.9, “good” is 1.9, and “great” is 3.1, whereas “horrible” is –2.5, the frowning emoticon “:(” is –2.2, and “sucks” and “sux” are both –1.5.

To classify a piece of text, the model calculates the valence score for each word, applies some grammatical rules e.g. distinguishing between ‘great’ and ‘not great’, and then sums up the result.

Interestingly, this simple lexicon-based approach has equal or better accuracy compared to machine-learning approaches, and is much faster. Let’s see how it works!

In this code we import the library, classify each title in our dataset then append the results to our original dataframe. We have added 4 new columns:

  • pos: positive score component
  • neu: neutral score component
  • neg: negative score component
  • compound: the sum of the three score components

As a sanity check, let’s take a look at the most positive, neutral and negative headline in the text by using pandas idxmax :

negative = df.iloc[df.neg.idxmax()]
neutral = df.iloc[df.neu.idxmax()]
positive = df.iloc[df.pos.idxmax()]
print(f'Most negative: {negative.TITLE} ({negative.PUBLISHER})')
print(f'Most neutral: {neutral.TITLE} ({neutral.PUBLISHER})')
print(f'Most positive: {positive.TITLE} ({positive.PUBLISHER})')

Running that code gives us the following result:

Most negative: I hate cancer (Las Vegas Review-Journal \(blog\))
Most neutral: Fed's Charles Plosser sees high bar for change in pace of tapering (Livemint)
Most positive: THANK HEAVENS (Daily Beast)

Fair enough — ‘THANKS HEAVENS’ is a lot more positive than ‘I hate cancer’!

Visualizing the results

What does the distribution of our scores look like? Let’s visualize this in a couple of ways using the interactive plotting library Altair:

Here we’re showing both a histogram for the overall distribution, as well as a 100% stacked bar chart grouped by category. Running that code, we get the following result:

Image by Author

Seems like most headlines are neutral, and health has overall more negative articles than the other categories.

To give more insight into how our model is classifying the articles, we can create two more plots, one showing a sample of how the model classifies particular headlines, and another showing the average sentiment score for our largest publishers over time:

This is where the declarative syntax of Altair really shines — we just change a few of the keywords e.g. mark_bar to mark_pointand we get a completely different yet still meaningful result:

By creating interactive visualizations, you enable viewers to explore the data directly. They’ll be much more likely to trust your overall conclusions if they can drill down to the original datapoints.

Looking at the publishers chart it seems that HuffPost is consistently more negative and RTT more positive. Hmmm, seems like they have different editorial strategies…

Creating a report

The final step is to package the results into an interactive report. As data scientists, we often forget to communicate our results effectively. I’ve often made the mistake of spending hours analysing data to answer somebody’s question then sending over a screenshot of a chart with a one-line explanation. My viewer then doesn’t understand the results and may not use them to actually make decisions.

Rule #1— always assume someone looking at your work has zero background and needs to understand from scratch. It’s always worth spending extra time and effort to write up the context and implications of your work.

For this tutorial we’ll use a library called Datapane to create a shareable report from these interactive visualizations. To do this, we’ll first need to create an account on Datapane, wrap our charts inside Datapane blocks and then upload the report.

We write chunks of text in Markdown format to give context, interspersed with our actual plots and data. You can see an embedded version of the chart with interactive visualizations here:

This is a minimal report that you could share with stakeholders (your boss, your mom etc) to show overall sentiment across the media landscape. From here you can explore comparisons across time, industry and publishers, with the goal of making recommendations for how your organization can improve.

Conclusion

We can summarize what we’ve learned in this tutorial through three points:

  1. Use sentiment analysis to extract value from unstructured text
  2. Build charts in interactive libraries like Altair so your viewers can explore the data themselves, reinforcing your overall conclusions
  3. Spend extra time and effort writing up your results into a report with context

Now go get em!

[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

--

--