Wiki
02806 Course Wiki

Lecture 5
Page last edited by Sune Lehmann Jørgensen (sljo) 24/03-2014

Location

  • Building 324, Room 060.

Learning objectives.

  • Learn to use word-list based sentiment analysis techniques
  • Learn the principles of Naive Bayes and apply to a Twitter corpus to create your own word list

Reading

  • Peter Sheridan Dodds, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, Christopher M. Danforth.  Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter.  PLoS ONE 6(12): e26752.  [paper link] [word  list link].
  • T. Segaran. Programming Collective Intelligence. Chapter 6, pp 117-127. [Download here]
  • Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings. University of Florida: the Center for Research in Psychophysiology. [paper link] [word list link]

Program

Exercise 1. Learn about Naive Bayes

  • Read PCI Chapter 6 pp 117-127 carefully.
  • Work through the book's little spam filtering example (include in your notebook). Create the docclass.py file on your own - based on the code in the book. The docclass.py file should be in the same directory as your IPython notebook, that will allow you to import the classes/functions into your Notebook.
  • Explain what happens when you run sampletrain(cl) on p 121?
  • Express in your own words what a conditional probability is. In the spam-example, if a word has conditional probability = 0.5 of being "good", what does that mean about that word in the training data?
  • What's the "Assumed probability" on page 122.
  • How is the prior probability of a category (e.g. good/bad) defined?
  • Explain Bayes' Theorem in your own words.
  • In what sense is Naive Bayes "naive"?

Exercise 2.  Apply Naive Bayes to a real dataset. (You will also secretly generate your own word list)

  • Download some tweets. [link  here] Each tweet in this sample has an emoticon happy/sad. Happy emoticons [":-)" or ":)"] belong to the class "happy" and sad emoticons [":-(" or ":("] belong to the class "sad". What is the probability of a happy vs a sad tweet? How does this probability relate to PCI chapter 6?
  • Use the code you generated in exercise 1 (above) to train your classifier to separate "happy" from "sad" tweets. You can use the in operator to do this. So the line ":-)" in tweet_text will evaluate True  if the tweet contains the string '':)".  You may decide what to do with tweets that contain both happy and sad emoticons (no right answer here, but justify your choice). 
  • The getwords(doc) function (PCI p 118) should get rid of emoticons for learning, but you may want to first remove twitter usernames (everything that starts with @, e.g. @suneman). Should you get rid of web-pages (explain your answer)? Train the classifier on a random 50% of the tweets and see how well you do on classifying the remaining 50%.

Exercise 3. Review and evaluate the LabMT word list

  • Read the Dodds et al. paper and explain how the LabMT word list was generated. Briefly explain how the authors validate the list.
  • Download a list of human rated tweets  here. The format is one tweet per line followed by 10 human ratings (tab separated). Start by calculating the average rating for each tweet. 
  • Calculate the LabMT score for each tweet and create a scatter plot of "average human" vs the LabMT score.
  • Now look at the correlation between "average human" and LabMT evaluation. Try both Pearson and Spearman correlation. Which one gives the highest correlation - can you explain why?

Exercise 4. What about the ANEW word list?

  • Read the ANEW paper and describe the main differences between the two word lists.
  • Which one do you expect to perform better? Why?
  • Redo your analysis of the human rated tweets usign the ANEW list. Which list do you now think is better - explain why?

Exercise 5. Your Naive Bayes word-list.The single word conditional probability, P("happy"|some-word) from your Naive Bayes classifier can be interpreted as sentiment valence for single words. You can quickly approximate this from the variable fc which contains the wordcounts.  

  • Generate a new word list by creating a text file with two columns (with fc trained on all the tweets):

word    fc[word].get('happy',0) / ( fc[word].get('happy',0) +  fc[word].get('sad',0) )

  • To avoid noise, limit the list to words that only occur 5 times or more (experiment with this threshold). How did you deal with zeros - explain your strategy.
  • Inspect the list - does it make sense? Try adding 5 to both the happy and sad count, e.g.

word    (fc[word].get('happy',0) + 5) / ( fc[word].get('happy',0) +   fc[word].get('sad',0) + 10)

  • How does this influence the top 10 saddest and happiest words? Does the list make more sense now? Can you explain why?
  • For the words in your list that are also in LabMT compare sentiment values (valences) to those derived from your Twitter corpus using a scatter plot. What is the correlation (and is Pearson or Spearman appropriate)? Can you spot any outliers - and what are they? Which list do you think is best for evaluating sentiment in tweets? 
  • Which sentiment list (ANEW, LabMT, Naive Bayes Probabilities) has the best correlation with average human rating? 

Exercise 6 (Bonus Exercise). Download and understand the Wikipedia Talk Pages Conversations corpus

  • Get the dataset  here.
  • Go over the README file (wikipedia.talkpages.README.v1.01.txt) carefully.
  • Write a little script to extract all the time-staps in the wikipedia.talkpages.conversations.txt file.
  • Use matplotlib to plot the time-series of number-of-edits-per-day.
  • Based on the CLEAN_TEXT field, calculated the sentiment of wikipedia edits per day.
  • Calculate the average sentiment for all users with more than 50 edits (hint, you can find info in the wikipedia.talkpages.userinfo.txt file) using the best word-list.
  • Visualize your result (using techniques from the class - and justify your choice). Also, comment on the difference between the happiest and saddest editor.
  • What is the ratio of male to female users on wikipedia? Do men write friendlier messages than women (or the other way around)? The userinfo file has gender info.

 

 

Support: +45 45 25 74 43