|
Page last edited by Ole Winther (olwi) 21/02-2013
Learning objectives.
- Practice API skills by working through additional
examples.
- Work with binning of geo-spatial data for alternatives to
standard dot-visualization.
- Begin to work with text extracted from the Flickr
API.
- Play with simple word-clouds (The statistics underlying the
word-clouds will lead towards the work on Machine-Learning coming
up in the subsequent lectures).
Program
Today we'll start with a more classical lecture, where I'll
discuss my take on visualization as a gateway to understanding data
and generating hypotheses. We'll see why the human eye can easily
do things that are difficult using mathematics.
For the exercises we're back with Flickr. We'll continue working
with data from the Flickr API.
Exercise 1. Use
Hexbin (from the matplotlib package) to divide and plot the
geographical location of the pictures from Flickr into bins.
- What are the advantages of binning rather than just plotting
the points? Can you think of a specific situation where binning
might help you see structure that you can't see by plotting simple
x,y coordinates? [Hint: find inspiration here.]
- Extract the location data from all pictures. If you did not
finish the exercise last week, a fast way to obtain the
geographical data is by using the module
flickr.photos_search(tags=[tags], bbox=[Coordinates for
Denmak], page=#, extras='geo',
format='json]) and iterating over the pages.
[Hint: If you cant figure out how to iterate over
pages, talk to one of the instructors!]
- With the data at hand, plot the hexbin over a basemap,
remember to use the extent option for
hexbin.
Exercise 2. Download tags for all pictures taken
in Denmark and tagged with 'Denmark'.
- To get the tags, import the flickrapi module and use
flickr.photos_search with arguments bbox, tags,
format, extras, and page to download all
tags.
[Hint: You can
use flickr.photos_search(tags='Denmark', bbox=[Coordinates for
Denmark], page=k, extras='tags', format='json') Note that
this strategy is very similar to the strategy in exercise 1;
basically extras='geo' is replaced
by extras='tags' in the query. Full
documentation for flickr.photos_search is
here]
- Remember to iterate over all the pages by using the argument
page=[number]
- Again, save all data, you will need it later in the class (when
we get to the machine learning part).
Exercise 3. Generate Word cloud for the photos
- Extract all tags from the newly obtained dataset, loop over the
files/dictionaries and add individial tags to a list ... or save to
a file.
- Use Wordle (or
similar, you can also use Python packages, etc) to create a Word
Cloud (disregard 'Denmark' since that one is in all the
tag lists).
- Which other tags are popular?
- Which other things can you learn about DK photos from the
word-cloud?
An note about optional exercises: It is not
necessary to complete the optional exercises!! It is possible to
get a perfect grade in this class without completing the
optional exercises - these are added for the benefit of students
who manage to complete the mandatory exercises and need extra
challenges. The optional exercises are typically difficult.
Also note that it is not a waste of time to work on the optional
exercises. The way we credit work on optional exercises is that
correct answers on these exercises can make up for incorrect
answers on the mandatory exercises.
Exercise 4. [Optional] Generate region specific Word
Clouds.
- Generate one Word Cloud for Jutland (Jylland) and one for
Zealand (Sjælland).
- You can either do this by splitting the already downloaded data
into two geographic bounding boxes, or call the API again, this
time with bounding boxes for the two regions.
- As before extract all tags from the data and plot the Word
Clouds, do you observe any differences between the popularity of
tags?
Exercise 5. [Optional] Create a trace network for the
10 users with most uploaded pictures.
- Go through your dataset and count how many pictures belong to
each user, the 'owner' field specifies a user.
- Extract geographical data for the 10 users with most pictures
(use the API). Remember to order the pictures accodring to their id
- which provides an indication of time (you may also acquire time
from the API).
- Using Basemaps and matplotlib, plot a network of the pictures
where the links are drawn between subsequent pictures.
- Color the traces of each user differently. What does the
network show? Can you distinguish any patterns between the various
users?
- Inspiration:

Reading:
There is no specific reading for today - you'll be finding the
information needed to complete the exercises by searching on line.
But the assignments have useful links embedded (I recommend using
them!).
This page will be permanently deleted and cannot be recovered. Are you sure?
|