Wiki
02806 Course Wiki

Assignment 1
Page last edited by Andrea Cuttone (ancu) 25/02-2014

Formalia: Please read http://dtu.cnwiki.dk/02806/page/1159/assignments-project carefully before proceeding. This page contains information about formatting (including restrictions on group size, etc), and many other aspects of handing in the assignment. If you fail to follow these simple instructions, it will negatively impact your grade! 

 

Due date: The due date is March 3rd 23:59. Hand in via CampusNet upload of relevant .ipynb files, etc.

 

Assignment 1

 

Part 1: Movie runtimes

First we analyze the duration of films from the dbpedia film dataset. dbpedia is a structured dump taken from wikipedia. We will work with a dataset in CSV format. Start by downloading the file film.csv from Campusnet shared folder, extract the archive and place the csv in your ipython notebook folder. 

a) Open the csv file and take a look at the first 5 lines. Line 1-4 contain metadata. Line 1 is the list of fields of the CSV file.  Print the position of the fields, for example:

0 URI
1 rdf-schema#label
2 rdf-schema#comment
3 basedOn_label
4 basedOn
5 budget

 

b) Extract name and runtime for the first 10 movies

 

c) Save the titles and runtimes for all movies  into lists, discarding the ones with missing or invalid values.  Explain in your own words: how did you define missing or invalid values? Justify your choices. 

 

d) Print the titles of the 10 longest and 10 shortest running.

 

e) We can find movies as long as thousands of hours, and as short as few seconds. They are probably outliers, as we know that normally movies last 1-3 hours. Take a look at the overall distribution, plotting the runtimes on the y axis, sorted by decreasing order. [Hint: use plt.yscale('log') to use a log scale.]. Explain in your own words: why is is a good idea to use a log scale?
 

f) Read about the  Cumulative distribution function (CDF): Explain with your words: what is a CDF and what it can be used for?  Plot the CDF using  statsmodels. Explain in your own words (EIYOW): What does the CDF plot indicates about the distribution of runtimes.  As we thought, most of the movies have less than 10^4 seconds runtime. Create a new list for runtimes smaller than 10^4.  EIYOW: how many movies have runtime >= 10^4?

 

g) Plot the histogram of the runtimes < 10^4 with 10, 50, 100 bins. EIYOW: what is the effect of changing number of bins? What do you think it is a more appropriate number for bins? Motivate your answers.

 

h) Read about kernel density estimation (KDE) EIYOW What is KDE and what can it be used for? Plot a KDE of the movie data using  scipyTry the effect of different kernel bandwidths.  EIYOW: What does the KDE show? What is the difference between the KDE and the histograms?  EIYOW: What is the effect of changing the kernel bandwidth?

 

i) Briefly summarize the insights obtained from this analysis about the film dataset

Part 2: Visualization background & theory

a) First, encodings.  Choose 3-4 visualizations from the websites below and for each describe how encondings are mapped to the data:

b) S elect 2-3 examples from  http://wtfviz.net/  and comment on the issues in regards to Tufte's and ACCENT guidelines.

 

Part 3: Timeseries

a) Import data. Now for the time series: the search volumes of "android" and "iphone" from Google Trends. You may substitute these with your own search queries, if you're interested (if you choose this option, it's an additional exercise to modify the exercises below in a reasonable way so they work for your dataset).  Download the iphonevsandroid.csv file from Campusnet shared folder Week03, and print the fist 2 lines.  The first column is a period, the second and the third contain the search volume between 0 and 100 for iphone and android respectively. Parse the first field to convert it to a single date object (hint: use datetime.strptime). You can choose any of the two dates.
 
 
b) Plot the search volume over time. Verify that your result looks similar to http://www.google.dk/trends/explore#q=android%2C%20iphone
 
 
c) Smooth the trends using a moving average function with different window sizes. EIYOW: what is the effect of different window sizes?
 
 
d) Read about LOWESS. Plot a LOWESS smoothed line using statsmodels with frac=0.2. EIYOW: what is the difference between LOWESS and running average?
 
 
e) Now we want to look for a yearly pattern in the Google Trend search volume for "weight loss". Load the wloss.csv from Campusnet and plot the search volumes from the years 2011-2013 in 3 subplots, one year per subplot. EIYOW: Do you observe a pattern by month? Can you think of a possible explanation for such patterns?
 
Part 4: Visualizations for the web
 
 
We analyze a dataset about meteors that fell on the Earth. Before proceeding to the actual visualizations, we need to preprocess the data in python.
 
a) Download meteors.json from Campusnet Week04 shared folder. Parse the json file using the json module (http://docs.python.org/2/library/json.html). Print the first element to understand its format. Parse the year, mass, lat and long fields and put them into 4 separate lists. Each list should have 1180 elements.  Count of number of meteors observations for every 10 years, starting from 1500. Print the counts. 
 
Google Charts.  For this part, simply include your html code as "raw text" in the IPython Notebook, as well as the images produced (notes on how to include content here).
 
b) We will start plotting the mass of meteors versus the year using a scatter plot. Look at the scatterchart example https://google-developers.appspot.com/chart/interactive/docs/gallery/scatterchart. Create a local html file, paste the example into it, and open it with a browser. You should see the same chart as the one on the website. Try to change the data in the chart, and experiment with the chart parameters (axis, labels, colors). Include a couple of examples in your Notebook.
 
c) Now we need to plug in our meteors data into the Google Chart. To do so, we need to generate text in the format:
 
    [ x1,      y1],
    [ x2,      y2],
    ...
    [ xn,     yn]
 
Using python, print the years and the masses as x and y, limiting the data only for years > 1900. Copy the text output, and paste it into the html file in the right place. Reload the html, and you should see that the graph has been updated.
Since the mass values are of different orders of magnitude, we need to set the log scale on the vertical axis. To do so, add logScale:true to the vAxis in the options object. EIYOW: can you see any pattern in this graph? (hint: remember the linear correlation discussion)
 
d) We now want to plot the number of meteors per decade using a column chart. Look at the example https://google-developers.appspot.com/chart/interactive/docs/gallery/columnchart. Create a local html file, paste the example into it, and open it with a browser. You should see the same chart as the one on the website. Try to change the data in the chart, and experiment with the chart parameters (axis, labels, colors). Include a couple of examples in your Notebook.
 
d) Now we need to plug in our meteors data into the Google Chart. To do so, we need to generate text in the format:
 
    [ x1,      y1],
    [ x2,      y2],
    ...
    [ xn,     yn]
 
Using python, print the years and the counts as x and y, using the histogram result from before. Copy the text output, and paste it into the html file in the right place. Reload the html, and you should see that the graph has been updated. Add bar: {groupWidth: "90%"}  to your options object to tweak the bars appearance.  EIYOW: can you see the increasing trend in the last 200 years? Does this mean the world is ending? Or can you think of some bias in the data?
 
e) Finally we will plot the locations of the meteors on a map. Look at the Markers example at https://google-developers.appspot.com/chart/interactive/docs/gallery/geochart. Create a local html file, paste the example into it, and open it with a browser. You should see the same chart as the one on the website. Try to change the data in the chart, and experiment with the chart parameters (axis, labels, colors).
 
Now we need to plug in our meteors data into the Google Chart. To do so, we need to generate text in the format:
 
    [lat1, lon1, mass1],
    [lat2, lon2, mass2],
    ...
    [latn, lonn, massn],
 
Using python, print the latitudes, longitudes and masses. Copy the text output, and paste it into the html file in the right place. Reload the html, and you should see that the graph has been updated. Change region: 'world' in the option object to show the whole map. 
 
IMPORTANT NOTE. For part 4, you may also and in the advanced exercise.
 
 
Support: +45 45 25 74 43