Lecture 5

02806 Course Wiki

Lecture 5

Textbooks

Lecture 7

Lecture 9

Assignment 2

Assignment 1

Assignment 3

Course Overview

Project Assignment A

Assignments & Project

Lecture 5

Page last edited by Sune Lehmann Jørgensen (sljo) 24/03-2014

Location

Building 324, Room 060.

Learning objectives.

Learn to use word-list based sentiment analysis techniques
Learn the principles of Naive Bayes and apply to a Twitter corpus to create your own word list

Reading

Peter Sheridan Dodds, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, Christopher M. Danforth. Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12): e26752. [paper link] [word list link].
T. Segaran. Programming Collective Intelligence. Chapter 6, pp 117-127. [Download here]
Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings. University of Florida: the Center for Research in Psychophysiology. [paper link] [word list link]

Program

Exercise 1. Learn about Naive Bayes

Read PCI Chapter 6 pp 117-127 carefully.
Work through the book's little spam filtering example (include in your notebook). Create the docclass.py file on your own - based on the code in the book. The docclass.py file should be in the same directory as your IPython notebook, that will allow you to import the classes/functions into your Notebook.
Explain what happens when you run sampletrain(cl) on p 121?
Express in your own words what a conditional probability is. In the spam-example, if a word has conditional probability = 0.5 of being "good", what does that mean about that word in the training data?
What's the "Assumed probability" on page 122.
How is the prior probability of a category (e.g. good/bad) defined?
Explain Bayes' Theorem in your own words.
In what sense is Naive Bayes "naive"?

Exercise 2. Apply Naive Bayes to a real dataset. (You will also secretly generate your own word list)

Download some tweets. [link here] Each tweet in this sample has an emoticon happy/sad. Happy emoticons [":-)" or ":)"] belong to the class "happy" and sad emoticons [":-(" or ":("] belong to the class "sad". What is the probability of a happy vs a sad tweet? How does this probability relate to PCI chapter 6?
Use the code you generated in exercise 1 (above) to train your classifier to separate "happy" from "sad" tweets. You can use the in operator to do this. So the line ":-)" in tweet_text will evaluate True if the tweet contains the string '':)". You may decide what to do with tweets that contain both happy and sad emoticons (no right answer here, but justify your choice).
The getwords(doc) function (PCI p 118) should get rid of emoticons for learning, but you may want to first remove twitter usernames (everything that starts with @, e.g. @suneman). Should you get rid of web-pages (explain your answer)? Train the classifier on a random 50% of the tweets and see how well you do on classifying the remaining 50%.

Exercise 3. Review and evaluate the LabMT word list

Read the Dodds et al. paper and explain how the LabMT word list was generated. Briefly explain how the authors validate the list.
Download a list of human rated tweets here. The format is one tweet per line followed by 10 human ratings (tab separated). Start by calculating the average rating for each tweet.
Calculate the LabMT score for each tweet and create a scatter plot of "average human" vs the LabMT score.
Now look at the correlation between "average human" and LabMT evaluation. Try both Pearson and Spearman correlation. Which one gives the highest correlation - can you explain why?

Exercise 4. What about the ANEW word list?

Read the ANEW paper and describe the main differences between the two word lists.
Which one do you expect to perform better? Why?
Redo your analysis of the human rated tweets usign the ANEW list. Which list do you now think is better - explain why?

Exercise 5. Your Naive Bayes word-list.The single word conditional probability, P("happy"|some-word) from your Naive Bayes classifier can be interpreted as sentiment valence for single words. You can quickly approximate this from the variable fc which contains the wordcounts.

Generate a new word list by creating a text file with two columns (with fc trained on all the tweets):

word fc[word].get('happy',0) / ( fc[word].get('happy',0) + fc[word].get('sad',0) )

To avoid noise, limit the list to words that only occur 5 times or more (experiment with this threshold). How did you deal with zeros - explain your strategy.
Inspect the list - does it make sense? Try adding 5 to both the happy and sad count, e.g.

word (fc[word].get('happy',0) + 5) / ( fc[word].get('happy',0) + fc[word].get('sad',0) + 10)

How does this influence the top 10 saddest and happiest words? Does the list make more sense now? Can you explain why?
For the words in your list that are also in LabMT compare sentiment values (valences) to those derived from your Twitter corpus using a scatter plot. What is the correlation (and is Pearson or Spearman appropriate)? Can you spot any outliers - and what are they? Which list do you think is best for evaluating sentiment in tweets?
Which sentiment list (ANEW, LabMT, Naive Bayes Probabilities) has the best correlation with average human rating?

Exercise 6 (Bonus Exercise). Download and understand the Wikipedia Talk Pages Conversations corpus.

Get the dataset here.
Go over the README file (wikipedia.talkpages.README.v1.01.txt) carefully.
Write a little script to extract all the time-staps in the wikipedia.talkpages.conversations.txt file.
Use matplotlib to plot the time-series of number-of-edits-per-day.
Based on the CLEAN_TEXT field, calculated the sentiment of wikipedia edits per day.
Calculate the average sentiment for all users with more than 50 edits (hint, you can find info in the wikipedia.talkpages.userinfo.txt file) using the best word-list.
Visualize your result (using techniques from the class - and justify your choice). Also, comment on the difference between the happiest and saddest editor.
What is the ratio of male to female users on wikipedia? Do men write friendlier messages than women (or the other way around)? The userinfo file has gender info.

Support: +45 45 25 74 43