|
|||||||||||||||||||||||
|
Assignment 1
Page last edited by Andrea Cuttone (ancu) 25/02-2014
Formalia: Please read http://dtu.cnwiki.dk/02806/page/1159/assignments-project carefully before proceeding. This page contains information about formatting (including restrictions on group size, etc), and many other aspects of handing in the assignment. If you fail to follow these simple instructions, it will negatively impact your grade!
Due date: The due date is March 3rd 23:59. Hand in via CampusNet upload of relevant .ipynb files, etc.
Assignment 1
Part 1: Movie runtimes First we analyze the duration of films from the dbpedia film dataset. dbpedia is a structured dump taken from wikipedia. We will work with a dataset in CSV format. Start by downloading the file film.csv from Campusnet shared folder, extract the archive and place the csv in your ipython notebook folder. a) Open the csv file and take a look at the first 5 lines. Line 1-4 contain metadata. Line 1 is the list of fields of the CSV file. Print the position of the fields, for example: 0 URI 1 rdf-schema#label 2 rdf-schema#comment 3 basedOn_label 4 basedOn 5 budget
b) Extract name and runtime for the first 10 movies
c) Save the titles and runtimes for all movies into lists, discarding the ones with missing or invalid values. Explain in your own words: how did you define missing or invalid values? Justify your choices.
d) Print the titles of the 10 longest and 10 shortest running.
f) Read about the Cumulative distribution function (CDF): Explain with your words: what is a CDF and what it can be used for? Plot the CDF using statsmodels. Explain in your own words (EIYOW): What does the CDF plot indicates about the distribution of runtimes. As we thought, most of the movies have less than 10^4 seconds runtime. Create a new list for runtimes smaller than 10^4. EIYOW: how many movies have runtime >= 10^4?
g) Plot the histogram of the runtimes < 10^4 with 10, 50, 100 bins. EIYOW: what is the effect of changing number of bins? What do you think it is a more appropriate number for bins? Motivate your answers.
h) Read about kernel density estimation (KDE) EIYOW: What is KDE and what can it be used for? Plot a KDE of the movie data using scipy. Try the effect of different kernel bandwidths. EIYOW: What does the KDE show? What is the difference between the KDE and the histograms? EIYOW: What is the effect of changing the kernel bandwidth?
i) Briefly summarize the insights obtained from this analysis about the film dataset Part 2: Visualization background & theory a) First, encodings. Choose 3-4 visualizations from the websites below and for each describe how encondings are mapped to the data:
b) S elect 2-3 examples from http://wtfviz.net/ and comment on the issues in regards to Tufte's and ACCENT guidelines.
Part 3: Timeseries a) Import data. Now for the time
series: the search volumes of "android" and "iphone" from Google
Trends. You may substitute these with your own search queries, if
you're interested (if you choose this option, it's an additional
exercise to modify the exercises below in a reasonable way so they
work for your dataset). Download the iphonevsandroid.csv file from
Campusnet shared folder Week03, and print the fist 2
lines. The first column
is a period, the second and the third contain the search volume
between 0 and 100 for iphone and android respectively. Parse the
first field to convert it to a single date object (hint: use
datetime.strptime). You can choose any of the two
dates.
b) Plot the search volume over time. Verify that your result
looks similar to
http://www.google.dk/trends/explore#q=android%2C%20iphone
c) Smooth the trends using a moving average function with
different window sizes. EIYOW: what is the effect
of different window sizes?
d) Read about LOWESS. Plot a LOWESS smoothed line using
statsmodels with frac=0.2. EIYOW: what is the difference between LOWESS
and running average?
e) Now we want to look for a yearly pattern in the Google
Trend search volume for "weight loss". Load the wloss.csv from
Campusnet and plot the search volumes from the years 2011-2013 in 3
subplots, one year per subplot. EIYOW:
Do you observe a pattern by month? Can you think of a possible
explanation for such patterns?
Part 4: Visualizations
for the web
We analyze a dataset about
meteors that fell on the Earth. Before proceeding to the actual
visualizations, we need to preprocess the data in
python.
a) Download meteors.json from Campusnet Week04 shared folder.
Parse the json file using the json module
(http://docs.python.org/2/library/json.html). Print the first
element to understand its format. Parse the year, mass, lat and long fields
and put them into 4 separate lists. Each list should have 1180
elements. Count of number
of meteors observations for every 10 years, starting from 1500.
Print the counts.
Google Charts. For this part, simply include
your html code as "raw text" in the IPython Notebook, as well as
the images produced (notes on how to include content
here).
b) We will start plotting the mass of meteors versus the year
using a scatter plot. Look at the scatterchart example
https://google-developers.appspot.com/chart/interactive/docs/gallery/scatterchart.
Create a local html file, paste the example into it, and open it
with a browser. You should see the same chart as the one on the
website. Try to change the data in the chart, and experiment with
the chart parameters (axis, labels, colors). Include a couple of
examples in your Notebook.
c) Now we need to plug in our meteors data into the Google
Chart. To do so, we need to generate text in the format:
[ x1, y1],
[ x2, y2],
...
[ xn, yn]
Using python, print the years and the masses as x and y,
limiting the data only for years > 1900. Copy the text output,
and paste it into the html file in the right place. Reload the
html, and you should see that the graph has been updated.
Since the mass values are of different orders of magnitude, we
need to set the log scale on the vertical axis. To do so, add
logScale:true to the vAxis in the options object. EIYOW:
can you see any pattern in this graph? (hint: remember the linear
correlation discussion)
d) We now want to plot the number of meteors per decade using
a column chart. Look at the example
https://google-developers.appspot.com/chart/interactive/docs/gallery/columnchart.
Create a local html file, paste the example into it, and open it
with a browser. You should see the same chart as the one on the
website. Try to change the data in the chart, and experiment with
the chart parameters (axis, labels, colors). Include a couple
of examples in your Notebook.
d) Now we need to plug in our meteors data into the Google
Chart. To do so, we need to generate text in the format:
[ x1, y1],
[ x2, y2],
...
[ xn, yn]
Using python, print the years and the counts as x and y, using
the histogram result from before. Copy the text output, and paste
it into the html file in the right place. Reload the html, and you
should see that the graph has been updated. Add bar: {groupWidth:
"90%"} to your
options object to tweak the bars appearance. EIYOW: can you see the
increasing trend in the last 200 years? Does this mean the world is
ending? Or can you think of some bias in the data?
e) Finally we will plot the locations of the meteors on a map.
Look at the Markers example at
https://google-developers.appspot.com/chart/interactive/docs/gallery/geochart.
Create a local html file, paste the example into it, and open it
with a browser. You should see the same chart as the one on the
website. Try to change the data in the chart, and experiment with
the chart parameters (axis, labels, colors).
Now we need to plug in our meteors data into the Google Chart.
To do so, we need to generate text in the format:
[lat1, lon1, mass1],
[lat2, lon2, mass2],
...
[latn, lonn, massn],
Using python, print the latitudes, longitudes and masses. Copy
the text output, and paste it into the html file in the right
place. Reload the html, and you should see that the graph has been
updated. Change region: 'world' in the option object to show the
whole map.
IMPORTANT NOTE. For part 4, you may also and in the
advanced exercise.
|
||||||||||||||||||||||