|
Page last edited by Sune Lehmann Jørgensen (sljo) 24/03-2014
Location
Learning objectives.
- Learn to use word-list based sentiment analysis techniques
- Learn the principles of Naive Bayes and apply to a Twitter
corpus to create your own word list
Reading
-
Peter Sheridan
Dodds, Kameron Decker
Harris, Isabel M.
Kloumann, Catherine A.
Bliss, Christopher M.
Danforth. Temporal Patterns of Happiness and
Information in a Global Social Network: Hedonometrics and
Twitter. PLoS ONE 6(12): e26752. [paper link] [word list link].
-
T. Segaran. Programming Collective Intelligence. Chapter 6, pp
117-127. [Download here]
- Bradley, M. M., & Lang, P. J.
(1999). Affective norms for English words (ANEW): Instruction
manual and affective ratings. University of Florida: the Center for
Research in Psychophysiology. [paper link]
[word list link]
Program
Exercise 1. Learn about Naive Bayes
- Read PCI Chapter 6 pp 117-127 carefully.
- Work through the book's little spam filtering example (include
in your notebook). Create the docclass.py file on your own - based
on the code in the book. The docclass.py file should be in the
same directory as your IPython notebook, that will allow you
to import the classes/functions into your Notebook.
- Explain what happens when you
run sampletrain(cl) on p 121?
- Express in your own words what a conditional probability is. In
the spam-example, if a word has conditional probability = 0.5 of
being "good", what does that mean about that word in the training
data?
- What's the "Assumed probability" on page 122.
- How is the prior probability of a category (e.g. good/bad)
defined?
- Explain Bayes' Theorem in your own words.
- In what sense is Naive Bayes "naive"?
Exercise 2. Apply
Naive Bayes to a real dataset. (You will also secretly generate
your own word list)
- Download some tweets. [link
here] Each tweet in this sample has an
emoticon happy/sad. Happy emoticons [":-)" or ":)"] belong to the
class "happy" and sad emoticons [":-(" or ":("] belong to the class
"sad". What is the probability of a happy vs a sad tweet? How does
this probability relate to PCI chapter 6?
- Use the code you generated in exercise 1 (above) to train your
classifier to separate "happy" from "sad" tweets. You can use the
in operator to do this. So the line ":-)"
in tweet_text will evaluate
True if the tweet contains the string
'':)". You may decide what to do with tweets
that contain both happy and sad emoticons (no right answer here,
but justify your choice).
- The getwords(doc) function (PCI p
118) should get rid of emoticons for learning, but you may want to
first remove twitter usernames (everything that starts with @, e.g.
@suneman). Should you get rid of web-pages (explain your answer)?
Train the classifier on a random 50% of the tweets and see how well
you do on classifying the remaining 50%.
Exercise 3. Review and evaluate the LabMT word list
- Read the Dodds et al. paper and explain how the LabMT word list was
generated. Briefly explain how the authors validate the
list.
- Download a list of human rated tweets
here. The format is one tweet per line followed by 10
human ratings (tab separated). Start by calculating the average
rating for each tweet.
- Calculate the LabMT score for each tweet and create a scatter
plot of "average human" vs the LabMT score.
- Now look at the correlation between "average human" and LabMT
evaluation. Try both Pearson and Spearman correlation. Which one
gives the highest correlation - can you explain why?
Exercise 4. What about the ANEW word list?
- Read the ANEW paper and describe the main differences between
the two word lists.
- Which one do you expect to perform better? Why?
- Redo your analysis of the human rated tweets usign the ANEW
list. Which list do you now think is better - explain why?
Exercise 5. Your Naive Bayes word-list.The single word conditional probability,
P("happy"|some-word) from your Naive Bayes classifier can be
interpreted as sentiment valence for single words. You can quickly
approximate this from the variable fc which
contains the wordcounts.
- Generate a new word list by creating a text file with two
columns (with fc trained on all the tweets):
word
fc[word].get('happy',0) /
( fc[word].get('happy',0) +
fc[word].get('sad',0) )
- To avoid noise, limit the list to words that only occur 5 times
or more (experiment with this threshold). How did you deal with
zeros - explain your strategy.
- Inspect the list - does it make sense? Try adding 5 to both the
happy and sad count, e.g.
word
(fc[word].get('happy',0) +
5) /
( fc[word].get('happy',0) +
fc[word].get('sad',0) +
10)
- How does this influence the top 10 saddest and happiest words?
Does the list make more sense now? Can you explain why?
- For the words in your list that are also in LabMT compare
sentiment values (valences) to those derived from your Twitter
corpus using a scatter plot. What is the correlation (and is
Pearson or Spearman appropriate)? Can you spot any outliers - and
what are they? Which list do you think is best for evaluating
sentiment in tweets?
- Which sentiment list (ANEW, LabMT, Naive Bayes Probabilities)
has the best correlation with average human rating?
Exercise 6 (Bonus Exercise). Download and understand
the Wikipedia
Talk Pages Conversations corpus.
- Get the dataset
here.
- Go over the README file (wikipedia.talkpages.README.v1.01.txt)
carefully.
- Write a little script to extract all the time-staps in the
wikipedia.talkpages.conversations.txt file.
- Use matplotlib to plot the time-series of
number-of-edits-per-day.
- Based on the CLEAN_TEXT field, calculated the sentiment of
wikipedia edits per day.
- Calculate the average sentiment for all users with more than 50
edits (hint, you can find info in the
wikipedia.talkpages.userinfo.txt file) using the best
word-list.
- Visualize your result (using techniques from the class - and
justify your choice). Also, comment on the difference between the
happiest and saddest editor.
- What is the ratio of male to female users on wikipedia? Do men
write friendlier messages than women (or the other way around)? The
userinfo file has gender info.
This page will be permanently deleted and cannot be recovered. Are you sure?
|