Page last edited by Lasse Valentini Jensen (s082732) 12/03-2013
Teacher
Today's session on machine learning is taught by Ole Winther.
Learning objectives
- Work with a specific unsupervised learning method, non-negative
matrix factorization, for finding independent features (in text
also called topics) in a new corpus and in Flickr images based upon
their metadata.
- Understand the concept of matrix factorization, independent
features, supervised and unsupervised learning.
- Apply matrix multiplication.
Reading
- Programming Collective
Intelligence (O'Reilly 2007). Toby
Segaran. Chapter 10, download [here].
Program
We start with a short lecture about unsupervised learning,
matrix multiplication, matrix factorization and finding independent
features in data. After that we will work with the material in
Chapter 10 of the textbook Programming Collective Intelligence by
Toby Segaran. First we will work with an example from the textbook
and after that turn to work with finding independent topics in
Flickr data tags using the non-negative matrix factorization.
Exercise 1. Make sure you have read today’s text
Programming Collective Intelligence Chapter 10 pages 226-249
[can be downloaded above]. Answer the following questions in
your own words (the answer to Exercise 1 should not exceed two
pages).
- Apply matrix multiplication to a small example (similar to
Figure 10-2) that you create yourself. Transpose the result. Show
that (AB)^T = B^T A^T holds for your example.
- Explain what happens in Figure 10-6. Explain what a feature
is.
- Explain on a high level how the two matrices in matrix
factorization are fitted. What difference is it that we minimize?
What is the additional constraint that we use in non-negative
matrix factorization? How many different features should we use
relative to the dimensions in the data matrix (more or less)? Hint:
Do we risk to overfit?
- Explain on a high level what we hope to get out of applying the
algorithm to the new corpus (page 227-229)?
- Explain on a high level what we hope to get out of applying the
algorithm to the stock data (pages 243-245)?
- What are the key differences between this method (unsupervised
approach) and the classifiers we have studied previous (supervised)
according to the book (as discussed in the beginning of the
chapter).
Exercise 2. In this exercise you should
work your way through the real data example in the book on
analyzing the news corpus (pages 227-243).
Hint: If you get NaN's or an error when running nfm
factorization on the feed-words, try replacing
w=matrix(array(w)*array(wn)/array(wd))
with
w=matrix(array(w)*array(wn)/(array(wd)+0.0000000000001))
in nmf.py.
- Reproduce the predictions given on pages 242 and 243. Display
prediction on additional documents.
- Interpret these results and discuss like in the book how the
features (topics) can give additional information not present in
the original document.
- Are all words informative? Why do sometimes remove very common
words (so-called stop-words)? Will very rare word be
informative?
Exercise 3. Finding independent features in Flickr
metadata. The basic idea is that you should use the same approach
as in Exercise 2 on the data you got from the Flickr API.
- Use features based on tags (or descriptions .... will need to
be downloaded using flickr.photos_search and the "extras" option),
represent the data in a matrix and apply the non-negative matrix
factorization algorithm.
- Interpret the features that you get? Do they make sense?
- Advanced question: Can we use the topics we get for
unsupervised classification, that is inspect whether the documents
which have a high weight in a certain topic all belong to the same
category (could be geographical location or something else).
This page will be permanently deleted and cannot be recovered. Are you sure?
|