Wiki
Social Data Modeling

Lecture 7
Page last edited by Sune Lehmann Jørgensen (sljo) 02/04-2013

Learning objectives (specific)

  • Extract (read & analyze) main findings from actual scientific papers.
  • Gain familiarity with various list based sentiment analysis techniques.
  • Apply word-list based sentiment analysis to large data sets.
  • Explain techniques for extimating robustness of word-list approach.
  • Compare and analyze word-list techniques.

Reading

The texts below are highly recommended - and you will have to read them in order to complete the exercises. The articles are:

Here are the sentiment-scoring word lists.

And the google n-gram data files.

 

Program

Today's lecture is about sentiment analysis. I'll talk you guys through a case that features sentiment analysis, visualization, and other elements from social data analysis.

 

Exercise 1. In depth with the papers.

  • What are the main findings in Quantitative Analysis of Culture Using Millions of Digitized Books? There's a lot of material in the article, so you will have to prioritize (that is difficult and part of the exercise). Use at most one column in the standard hand-in format for answering this question.
  • What about the two other dimensions in Affective norms for english words:  What is "arousal" and "dominance"?
  • How was the labMT word list generated?
  • Robustness of word lists. Explain how Figure 2A in Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter [TPHIGSNHT]  was generated. How does that prove the robustness of the labMT word list?
  • Explain in your own words, how the word-shift graphs in Figure 4 of TPHIGSNHT are generated. What is the basic idea?
  • How was the AFINN word list generated?
  • Based on the article Word String frequency distributions, it seems that the google n-gram files contain quite a lot of errors. Do you think the errors influence the findings in Quantitative Analysis of Culture Using Millions of Digitized Books? Justify your answer. Use your python hacking skills to find an example similar to "copy" (or "succeed") from the blog posting (http://languagelog.ldc.upenn.edu/nll/?p=4456) in one of the google data files.

Exercise 2. Historymood.  

  • Create the "historymood" plot. Based on the google 1-gram corpus, and the labMT word-list, calculate the "mood" as a function of time (one datapoint per year) for US and British english corpera (make sure you take counts [match_count] of each word into account). Plot your results. Do you recognize any historic events? What are the differences between the US and British English plots?
  • Hint 1: To manage the large files, consider using the module gzip instead of decompressing them.
  • Hint 2:   The equation used to calculate the mood was part of the lecture, but you can also find it in  TPHIGSNHT equation 1.
  • Word-shifts. Choose a historic event that you like in one of the corpera. Use the word-shift technique explained in Section 4.2 of TPHIGSNHT to figure out which words are responsible for the change in sentiment. You don't have to create fancy plots like the ones in the paper (although it would be cool if you did) - it's the analysis that counts, so any understandable representation of your results is fine. Explain and interpret your findings.
  • Comparing word lists. Redo your analyses with one of the other word-lists. What are the main differences between your results for the two word lists? Which list do you prefer? Why?
  • [Extra Credit] Robustness. We know that labMT is robust on the twitter corpus (as demonstrated in Figure 2 of TPHIGSNHT). Take your two word lists and for each list, perform a robustness analysis similar to the one in Figure 2a of TPHIGSNHT but with the x-axis running over the time-span contained in the n-gram copus. Are there differences between the lists? Does one seem more robust than the other.

 

 

 

 

Support: +45 45 25 74 43