Wiki
Social Data Modeling

Lecture 2
Page last edited by Vedran Sekara (vese) 12/02-2013

 

Learning Objectives (Specific)

  • Create (simple) visualizations using Python's matplotlib.
  • Download real-world data for visualization.
  • Work with the (messy) real-world data to create visualization.
  • Extend work with APIs to acquire Flickr data for later use. Also try out convenient Python module for communication with the API
  • Create (advanced) geo-visualizations using Python's matplotlib & basemaps.

Lecture

I'm traveling this week, so today's lecture is hosted by Vedran, who's a grad student in my lab. We have a number of interesting elements on the program.

  • Edward Tufte's rules for visualization (by guest lecturer Lasse Mølgaard, DTU Compute, Cognitive Systems Sections). Reading for this part is:   http://neuroscience.telenczuk.pl/wp-content/uploads/2010/10/vis_handout.pdf. (Note that this pdf also has useful notes on matplotlib).
  • Journalism in the age of data. You'll be watching the first part of a great documentary on the role of visualization in modern media (http://datajournalism.stanford.edu). I recommend you check out the remaining parts - lots of info on good tools, etc.
  • Work on exercises. This one goes without saying. The exercises are the center of the class, and as always, if you work your way through the exercises, you'll be well on your way to mastering the curriculum. Today, you'll begin to visualize some data on your own ... and continue the work on the APIs from last time; this will sharpen your ability to interact with Web APIs to find cool data sets for visualization.

Group work

Exercise 1. Play with Python's matplotlib (http://matplotlib.sourceforge.net/) and real data.

  • Go to the Guardian Data Store, http://www.guardian.co.uk/news/datablog+society/alcohol and find the data on alcohol consumption across the globe: The per capita recorded alcohol consumption (litres of pure alcohol) among adults (older than 15 years). The data accompanies the article "Boozers of the world" from 9 March 2009. Download the data to your computer, clean it, and read it into Python. This part is painful & more difficult than one might think. You'll have to decide how to do this: will you get the data ready for Python via excel, via google docs, import using a python tool? And how do you handle missing values, etc.? This will be a difficult step for most of you, but that's part of the learning experience - real data is usually messy and requires cleanup.
  • First, find the top 5 drinking countries using Python (I recommend sorting the list of countries in descending order, check out http://wiki.python.org/moin/HowTo/Sorting).
  • What number on the list is Denmark?
  • Generate a simple line-plot of the countries' consumption in descending order using the matplotlib.pyplot (henceforth abbreviated as "plt") command plot.
  • Create a barchart of the same data, using the plt command bar. What is the problem with this visualization?
  • Revise your barchart so as to emphasize the difference between the countries that drink the most and the countries that drink the least. To achieve this goal, construct a barchart of the top 15 and bottom 15 alcohol-consuming contries, with country names on the x-axis. Something like this, but not necessarily as fancy! Note: assignment continues below image.

Bar chart

Excercise 2. Continuation of playing around with the Flickr API

  • Instead of constructing URLs for each query (as we did last time), it is possible to greatly streamline the process of accessing the Flickr API using the python module Flickrapi. Download and install the package (for example,  use easy_install to get the package).

Usage:
import flickrapi
flickr=flickrapi.FlickrAPI(api_key,cache=True)
photos=flickr.photos_search(tags=['Monty','Python'],format='json')
  • "photos_search" is only one of many avaliable methods, others can be found here, just remember to substitute the dot with an underscore when calling the method via the flickrapi package!
  • Download metadata for 2x500 pictures that are all taken in Denmark. First get 500 tagged with words related to the coastline, e.g. tags=['beach','water','coast','coastline','sea',...]. . Next get metadata for 500 images tagged with tags related "inland nature", e.g. tags=['forest','nature','grass','green','landscape',...]
  • hint: Use page and bbox as additional arguments when you search photos. Bbox specifies the geographic bounding box (find one for denmark using this page) and page the page-number of the search results (limited to 250 pictures per page). Save the data to a text file or pickle it.
  • Download the GPS locations for all the pictures using the method photos_geo_getLocation. Again, don't forget to save the data.

 

Exercise 3. Play with basemaps (http://matplotlib.org/basemap/) to plot geographical Flickr data from last week.

  • Use basemaps to create a map of Denmark (examples: http://matplotlib.org/basemap/users/examples.html)
  • Extract latiitude and longitude coordinates from all the downloaded Flickr pictures (hint: the returned JSON structure is a dictionary, you can extract the coordinates by using ['photo']['location']['latitude'] and ...]['longutide'] as keys).
  • Plot the location of each picture onto the map you've just created, start with the coast line locations. What does the distribution of points resemble?
  • Now (using a new symbol for the datapoints), plot the inland-nature points. Can you tell the difference?
  • We will be working with data from Flickr throughout the course. So for next, week download metadata for 10.000 pictures (use tags=['Denmark','Danmark']). Since Flickr has a max cap of 3600 API requests per hour you need to build in a timer into your python script http://docs.python.org/2/library/time.html, read about the package denoted sleep

 

For inspiration, the figure below show an example of a map with ~18000 flickr locations plotted.

Links

Support: +45 45 25 74 43