Wiki
Social Data Modeling

Lecture 6
Page last edited by Lasse Valentini Jensen (s082732) 12/03-2013

 

Teacher

Today's session on machine learning is taught by Ole Winther.

 

Learning objectives

  • Work with a specific unsupervised learning method, non-negative matrix factorization, for finding independent features (in text also called topics) in a new corpus and in Flickr images based upon their metadata.
  • Understand the concept of matrix factorization, independent features, supervised and unsupervised learning.
  • Apply matrix multiplication.

Reading

  • Programming Collective Intelligence  (O'Reilly 2007). Toby Segaran. Chapter 10, download [here].

Program

We start with a short lecture about unsupervised learning, matrix multiplication, matrix factorization and finding independent features in data. After that we will work with the material in Chapter 10 of the textbook Programming Collective Intelligence by Toby Segaran. First we will work with an example from the textbook and after that turn to work with finding independent topics in Flickr data tags using the non-negative matrix factorization.   

 

Exercise 1.  Make sure you have read today’s text Programming Collective Intelligence Chapter 10 pages 226-249  [can be downloaded above].  Answer the following questions in your own words (the answer to Exercise 1 should not exceed two pages).

  • Apply matrix multiplication to a small example (similar to Figure 10-2) that you create yourself. Transpose the result. Show that (AB)^T = B^T A^T holds for your example.
  • Explain what happens in Figure 10-6. Explain what a feature is.
  • Explain on a high level how the two matrices in matrix factorization are fitted. What difference is it that we minimize? What is the additional constraint that we use in non-negative matrix factorization? How many different features should we use relative to the dimensions in the data matrix (more or less)? Hint: Do we risk to overfit?
  • Explain on a high level what we hope to get out of applying the algorithm to the new corpus (page 227-229)?
  • Explain on a high level what we hope to get out of applying the algorithm to the stock data (pages 243-245)?
  • What are the key differences between this method (unsupervised approach) and the classifiers we have studied previous (supervised) according to the book (as discussed in the beginning of the chapter).

 

Exercise 2.   In this exercise you should work your way through the real data example in the book on analyzing the news corpus (pages 227-243).

Hint: If you get NaN's or an error when running nfm factorization on the feed-words, try replacing 
w=matrix(array(w)*array(wn)/array(wd))
with
w=matrix(array(w)*array(wn)/(array(wd)+0.0000000000001))
in nmf.py.

  • Reproduce the predictions given on pages 242 and 243. Display prediction on additional documents.
  • Interpret these results and discuss like in the book how the features (topics) can give additional information not present in the original document.
  • Are all words informative? Why do sometimes remove very common words (so-called stop-words)? Will very rare word be informative?

 

Exercise 3. Finding independent features in Flickr metadata. The basic idea is that you should use the same approach as in Exercise 2 on the data you got from the Flickr API.

  • Use features based on tags (or descriptions .... will need to be downloaded using flickr.photos_search and the "extras" option), represent the data in a matrix and apply the non-negative matrix factorization algorithm.
  • Interpret the features that you get? Do they make sense?
  • Advanced question: Can we use the topics we get for unsupervised classification, that is inspect whether the documents which have a high weight in a certain topic all belong to the same category (could be geographical location or something else).
Support: +45 45 25 74 43