Wiki
Social Data Modeling

Assignment 3
Page last edited by Vedran Sekara (vese) 02/04-2013

 

Formalia: Please read http://dtu.cnwiki.dk/02822/page/666/assignments-grading carefully before proceeding. This page contains information about formatting (including restrictions on size, etc), group sizes, and many other aspects of handing in the assignment. If you fail to follow these simple instructions, it will negatively impact your grade!

 

Due date and time: The assignment is due on Monday April 8th at 23:59.

Exercise 1. Understanding Sentiment Analysis

  • What are the main findings in Quantitative Analysis of Culture Using Millions of Digitized Books? There's a lot of material in the article, so you will have to prioritize (that is difficult and part of the exercise). Use at most one column in the standard hand-in format for answering this question.
  • What about the two other dimensions in Affective norms for english words:   What is "arousal" and "dominance"?
  • How was the labMT word list generated?
  • Robustness of word lists. Explain how Figure 2A in Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter [TPHIGSNHT]   was generated. How does that prove the robustness of the labMT word list?
  • How was the AFINN word list generated?
  • Based on the article Word String frequency distributions, it seems that the google n-gram files contain quite a lot of errors. Do you think the errors influence the findings in Quantitative Analysis of Culture Using Millions of Digitized Books? Justify your answer. Use your python hacking skills to find an example similar to "copy" (or "succeed") from the blog posting (http://languagelog.ldc.upenn.edu/nll/?p=4456) in one of the google data files.

Exercise 2. Historymood.  

  • Create the "historymood" plot. Based on the google 1-gram corpus, and the labMT word-list, calculate the "mood" as a function of time (one datapoint per year) for US and British english corpera (make sure you take counts [match_count] of each word into account). Plot your results. Do you recognize any historic events? What are the differences between the US and British English plots?
  • Hint 1: To manage the large files, consider using the module  gzip instead of decompressing them.
  • Hint 2:   The equation used to calculate the mood was part of the lecture, but you can also find it in  TPHIGSNHT equation 1.

 

Exercise 3. A couple of questions on network models

  • What does clustering mean to a network scientist?
  • What does it mean that a network is a "small-world"? (hint: remember to include clustering in your answer).
  • What is the network degree distribution?
  • What is a power-law?
  • Name a few examples of networks with power-law degree distributions.

Exercise 4. The Barabasi-Albert (BA) model in NetworkX.

  • Use NetworkX to code up the BA model. Start with 5 nodes connected at random. Add one node at a time and have each new node connect to 3 existing nodes. Keep track of the age of each node (e.g. by naming the nodes by the time-step they've been introduced). Hint: The trickiest thing about coding up this model is choosing a node with probability proportional to its degree. First, you will need to to be able to choose stuff at random: use this module. Now, the easiest way that I can think of to do this is to create a list with each node occurring with its degree and simply picking a random node from this list (but maybe you can find a better way). [It's possible to create BA networks with a builtin NetworkX function - that's not ok - you must write the code on your own].
  • Once you've generated a network of 300 nodes, use NetworkX and matplotlib to plot the network.
  • Now, create a new network of 5000 nodes and plot the degree distribution (use both loglog and linear scales). Hint: To see an example of plotting a network degree distribution, check out pp 26-28 here: http://www.stanford.edu/class/cs224w/nx_tutorial.pdf .
  • Fit the data in the 5000 node network to find the slope of the straight line in the log-log plot. [Hint:  linear regression in python]. Generate 200 networks and fit each one - what's the average value and variance across for all slopes? Does that answer correspond to what you expected to find, according to the theory
  • Age. Calculate the average degree as a function of node-age for your 200 networks. (What is the average degree of all of the oldest nodes, what's the average degree of all the second-oldest nodes, what's the average degree of all the the third oldest nodes, etc). Create a plot of average degree on the y-axis and node-age on the x-axis. Is this the picture you expect to see in real-world networks? Justify your answer.
  • Do you think the BA model is a good model for real-world networks? Explain the reasons for your anwer.

 

Advanced reading (will help you answer some of the questions for exercise 4 above)

  • Adamic and Huberman (2000). Power law distribution of the World Wide Web. Science287:2115. Download here, helps with the age-question.
  • Goldstein, Morris, Yen (2004). Problems with fitting to the power-law distribution. Eur. Phys. J. B 41, 255–258 (2004). Download here. Helps with problems with fitting power-laws question.
Support: +45 45 25 74 43