|
Page last edited by Sune Lehmann Jørgensen (sljo) 02/04-2013
Learning objectives (specific)
- Extract (read & analyze) main findings from actual
scientific papers.
- Gain familiarity with various list based sentiment analysis
techniques.
- Apply word-list based sentiment analysis to large data
sets.
- Explain techniques for extimating robustness of word-list
approach.
- Compare and analyze word-list techniques.
Reading
The texts below are highly recommended - and you will have to
read them in order to complete the exercises. The articles are:
Here are the sentiment-scoring word lists.
And the google n-gram data files.
Program
Today's lecture is about sentiment analysis. I'll talk you guys
through a case that features sentiment analysis, visualization, and
other elements from social data analysis.
Exercise 1. In depth with the papers.
- What are the main findings in Quantitative Analysis of
Culture Using Millions of Digitized Books? There's a lot of
material in the article, so you will have to prioritize (that is
difficult and part of the exercise). Use at most one column in the
standard hand-in format for answering this question.
- What about the two other dimensions in Affective norms
for english words: What is "arousal" and
"dominance"?
- How was the labMT word list generated?
- Robustness of word lists. Explain how Figure 2A
in Temporal patterns of happiness and information in a
global social network: Hedonometrics and twitter
[TPHIGSNHT] was generated. How does that prove the
robustness of the labMT word list?
- Explain in your own words, how the word-shift graphs in Figure
4 of TPHIGSNHT are generated. What is the basic
idea?
- How was the AFINN word list generated?
- Based on the article
Word String
frequency distributions, it seems that the google
n-gram files contain quite a lot of errors. Do you think
the errors influence the findings in Quantitative Analysis
of Culture Using Millions of Digitized Books? Justify your
answer. Use your python hacking skills to find an example similar
to "copy" (or "succeed") from the blog posting (http://languagelog.ldc.upenn.edu/nll/?p=4456)
in one of the google data files.
Exercise 2. Historymood.
- Create the "historymood" plot. Based on the google 1-gram
corpus, and the labMT word-list, calculate the "mood" as a function
of time (one datapoint per year) for US and British english corpera
(make sure you take counts [match_count] of each
word into account). Plot your results. Do you recognize any
historic events? What are the differences between the US and
British English plots?
- Hint 1: To manage the large files, consider
using the module
gzip instead of decompressing them.
- Hint 2: The equation used to
calculate the mood was part of the lecture, but you can also find
it in
TPHIGSNHT equation 1.
- Word-shifts. Choose a historic event that you like in one of
the corpera. Use the word-shift technique explained in Section 4.2
of TPHIGSNHT to figure out which words are responsible for the
change in sentiment. You don't have to create fancy plots like the
ones in the paper (although it would be cool if you did) - it's the
analysis that counts, so any understandable representation of your
results is fine. Explain and interpret your findings.
- Comparing word lists. Redo your analyses with one of the other
word-lists. What are the main differences between your results for
the two word lists? Which list do you prefer? Why?
- [Extra Credit] Robustness. We know that labMT is robust on the
twitter corpus (as demonstrated in Figure 2 of TPHIGSNHT). Take
your two word lists and for each list, perform a robustness
analysis similar to the one in Figure 2a of TPHIGSNHT but with
the x-axis running over the time-span contained in the
n-gram copus. Are there differences between the lists? Does one
seem more robust than the other.
This page will be permanently deleted and cannot be recovered. Are you sure?
|