Wiki
Social Data Modeling

Lecture 3
Page last edited by Ole Winther (olwi) 21/02-2013

Learning objectives.

  • Practice API skills by working through additional examples.
  • Work with binning of geo-spatial data for alternatives to standard dot-visualization.
  • Begin to work with text extracted from the Flickr API. 
  • Play with simple word-clouds (The statistics underlying the word-clouds will lead towards the work on Machine-Learning coming up in the subsequent lectures).

Program

Today we'll start with a more classical lecture, where I'll discuss my take on visualization as a gateway to understanding data and generating hypotheses. We'll see why the human eye can easily do things that are difficult using mathematics.

For the exercises we're back with Flickr. We'll continue working with data from the Flickr API.

 

Exercise 1. Use Hexbin (from the matplotlib package) to divide and plot the geographical location of the pictures from Flickr into bins.

  • What are the advantages of binning rather than just plotting the points? Can you think of a specific situation where binning might help you see structure that you can't see by plotting simple x,y coordinates? [Hint: find inspiration here.]
  • Extract the location data from all pictures. If you did not finish the exercise last week, a fast way to obtain the geographical data is by using the module flickr.photos_search(tags=[tags], bbox=[Coordinates for Denmak], page=#, extras='geo', format='json]) and iterating over the pages. [Hint: If you cant figure out how to iterate over pages, talk to one of the instructors!]
  • With the data at hand, plot the hexbin over a basemap, remember to use the extent option for hexbin.

 

Exercise 2.  Download tags for all pictures taken in Denmark and tagged with 'Denmark'.

  • To get the tags, import the flickrapi module and use flickr.photos_search with arguments bbox, tags, format, extras, and page to download all tags.
    [Hint:  You can use flickr.photos_search(tags='Denmark', bbox=[Coordinates for Denmark], page=k, extras='tags', format='json') Note that this strategy is very similar to the strategy in exercise 1; basically extras='geo' is replaced by extras='tags' in the query. Full documentation for flickr.photos_search is here]
  • Remember to iterate over all the pages by using the argument page=[number]
  • Again, save all data, you will need it later in the class (when we get to the machine learning part).

 

Exercise 3. Generate Word cloud for the photos

  • Extract all tags from the newly obtained dataset, loop over the files/dictionaries and add individial tags to a list ... or save to a file.
  • Use Wordle (or similar, you can also use Python packages, etc) to create a Word Cloud (disregard 'Denmark' since that one is in all the tag lists).
  • Which other tags are popular? 
  • Which other things can you learn about DK photos from the word-cloud?

An note about optional exercises: It is not necessary to complete the optional exercises!! It is possible to get a perfect grade in this class without completing the optional exercises - these are added for the benefit of students who manage to complete the mandatory exercises and need extra challenges. The optional exercises are typically difficult.

Also note that it is not a waste of time to work on the optional exercises. The way we credit work on optional exercises is that correct answers on these exercises can make up for incorrect answers on the mandatory exercises.

 

Exercise 4. [Optional] Generate region specific Word Clouds.

  • Generate one Word Cloud for Jutland (Jylland) and one for Zealand (Sjælland). 
  • You can either do this by splitting the already downloaded data into two geographic bounding boxes, or call the API again, this time with bounding boxes for the two regions.
  • As before extract all tags from the data and plot the Word Clouds, do you observe any differences between the popularity of tags?

 

Exercise 5. [Optional] Create a trace network for the 10 users with most uploaded pictures.

  • Go through your dataset and count how many pictures belong to each user, the 'owner' field specifies a user.
  • Extract geographical data for the 10 users with most pictures (use the API). Remember to order the pictures accodring to their id - which provides an indication of time (you may also acquire time from the API).
  • Using Basemaps and matplotlib, plot a network of the pictures where the links are drawn between subsequent pictures.
  • Color the traces of each user differently. What does the network show? Can you distinguish any patterns between the various users?
  • Inspiration:

 

Reading:

There is no specific reading for today - you'll be finding the information needed to complete the exercises by searching on line. But the assignments have useful links embedded (I recommend using them!).

Support: +45 45 25 74 43