|
Page last edited by Vedran Sekara (vese) 02/04-2013
Formalia: Please read http://dtu.cnwiki.dk/02822/page/666/assignments-grading carefully
before proceeding. This page contains information about formatting
(including restrictions on size, etc), group sizes, and many other
aspects of handing in the assignment. If you fail to
follow these simple instructions, it will negatively impact your
grade!
Due date and time: The assignment is due
on Monday April 8th at 23:59.
Exercise 1. Understanding Sentiment Analysis
- What are the main findings in Quantitative Analysis of
Culture Using Millions of Digitized Books? There's a lot of
material in the article, so you will have to prioritize (that is
difficult and part of the exercise). Use at most one column in the
standard hand-in format for answering this question.
- What about the two other dimensions in Affective norms
for english words: What is "arousal" and
"dominance"?
- How was the labMT word list generated?
- Robustness of word lists. Explain how Figure 2A
in Temporal patterns of happiness and information in a
global social network: Hedonometrics and twitter
[TPHIGSNHT] was generated. How does that prove the
robustness of the labMT word list?
- How was the AFINN word list generated?
- Based on the article
Word String
frequency distributions, it seems that the
google n-gram files contain quite a lot of errors. Do
you think the errors influence the findings
in Quantitative Analysis of Culture Using Millions of
Digitized Books? Justify your answer. Use your python hacking
skills to find an example similar to "copy" (or "succeed") from the
blog posting (http://languagelog.ldc.upenn.edu/nll/?p=4456)
in one of the google data files.
Exercise 2. Historymood.
- Create the "historymood" plot. Based on the google 1-gram
corpus, and the labMT word-list, calculate the "mood" as a function
of time (one datapoint per year) for US and British english corpera
(make sure you take counts [match_count] of each
word into account). Plot your results. Do you recognize any
historic events? What are the differences between the US and
British English plots?
- Hint 1: To manage the large files,
consider using the module
gzip instead of decompressing them.
- Hint 2: The equation used to
calculate the mood was part of the lecture, but you can also find
it in
TPHIGSNHT equation 1.
Exercise 3. A couple of questions on network models
- What does clustering mean
to a network scientist?
- What does it mean that a network is a "small-world"?
(hint: remember to include clustering in your answer).
- What is the network degree distribution?
- What is a power-law?
- Name a few examples of networks with power-law degree
distributions.
Exercise 4. The Barabasi-Albert (BA) model in
NetworkX.
- Use NetworkX to code up the BA model. Start with 5 nodes
connected at random. Add one node at a time and have each new node
connect to 3 existing nodes. Keep track of the age of each node
(e.g. by naming the nodes by the time-step they've been
introduced). Hint: The trickiest thing
about coding up this model is choosing a node with probability
proportional to its degree. First, you will need to to be able to
choose stuff at random: use this
module. Now, the easiest way that I can think of to do this is
to create a list with each node occurring with its degree and
simply picking a random node from this list (but maybe you can find
a better way). [It's possible to create BA networks with a builtin
NetworkX function - that's not ok - you
must write the code on your own].
- Once you've generated a network of 300 nodes, use NetworkX and
matplotlib to plot the network.
- Now, create a new network of 5000 nodes and plot the degree
distribution (use both loglog and linear
scales). Hint: To see an example of plotting
a network degree distribution, check out pp 26-28 here: http://www.stanford.edu/class/cs224w/nx_tutorial.pdf .
- Fit the data in the 5000 node network to find the slope of the
straight line in the log-log plot. [Hint:
linear regression in python]. Generate 200 networks and fit
each one - what's the average value and variance across for all
slopes? Does that answer correspond to what you expected to
find, according
to the theory?
- Age. Calculate the average degree as a function of node-age for
your 200 networks. (What is the average degree of all of the oldest
nodes, what's the average degree of all the second-oldest nodes,
what's the average degree of all the the third oldest nodes, etc).
Create a plot of average degree on the y-axis and
node-age on the x-axis. Is this the picture you
expect to see in real-world networks? Justify your answer.
- Do you think the BA model is a good model for real-world
networks? Explain the reasons for your anwer.
Advanced reading (will help you answer some of the
questions for exercise 4 above)
- Adamic and Huberman (2000). Power law distribution of the World
Wide Web. Science, 287:2115.
Download
here, helps with the age-question.
- Goldstein, Morris, Yen (2004). Problems with fitting to the
power-law distribution. Eur. Phys. J. B 41,
255–258 (2004). Download here. Helps
with problems with fitting power-laws question.
This page will be permanently deleted and cannot be recovered. Are you sure?
|