Assignment 2

Social Data Modeling

Lecture 1

Textbooks

Lecture 2

Assignments & Project

	Assignment 1
	Assignment 2
	Assignment 3
	Project Assignment A
	Project Assignment B
	About Evaluations

Lecture 3

Course Overview

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Assignment 2

Page last edited by Ole Winther (olwi) 12/03-2013

Formalia: Please read http://dtu.cnwiki.dk/02822/page/666/assignments-grading carefully before proceeding. This page contains information about formatting (including restrictions on size, etc), group sizes, and many other aspects of handing in the assignment. If you fail to follow these simple instructions, it will negatively impact your grade!

Due date and time: The assignment is due on Monday March 18th at 23:59.

Exercise 1a. Make sure you have read Programming Collective Intelligence Chapter 6 pages 117-141 (excluding page 127-131 and 138-139). Answers the following questions in your own words.

Explain the concept of classification. Come up with an example where one can use a classifier. What kind of features are meaningful to use in this example?
What are the features we use for document classification?
What is naïve about the naïve Bayes classifier? Is the assumption reasonable in the example you came up with above?
Give an example of the use of Bayes’ theorem. Explain how we calculate/set each term in the Category and Document setting using the training data and the category prior. Does the category prior have the same type of effect as the assumed probability (defined on page 122)?

Exercise 1b. Make sure you have read Programming Collective Intelligence Chapter 7 pages 142-166 (excluding page 151-153). Answer the following questions in your own words.

The decision tree is a classifier just like the naïve Bayes classifier we studied last week. What is the main attraction about the decision tree according to the book?
Describe the process of building the tree from training data. Describe how to choose which variable to split on. Describe Gini impurity and entropy. Describe the recursive tree building process.
What is overfitting? How can pruning of the tree as described in the book cure overfitting?
What are missing values? How does the decision tree deal with missing values? How would you deal with missing values in the naïve Bayes classifier?

Exercise 1c. Make sure you have read Programming Collective Intelligence Chapter 10 pages 226-249. Answer the following questions in your own words.

Explain in your own words, the principle of how the two matrices in non-negative matrix factorization are fitted (no math needed). Which difference is it that we minimize? What is the additional constraint that we use in non-negative matrix factorization? How many different features should we use relative to the dimensions in the data matrix (more or less)? Hint: Do we risk to overfit?
Explain on a high level what we hope to get out of applying the algorithm to the news corpus (page 227-229)?
Explain on a high level what we hope to get out of applying the algorithm to the stock data (pages 243-245)?

Exercise 2. Naïve Bayes for classification of Flickr data.

Choose the categories you want to classify. Use as categories either a few geographical locations (for example Zealand and Jutland) or type of landscape (nature and city, etc.). In the first case you can use a bounding box to validate the category and in the second case you need to manually open and inspect the photos.
Use features based on tags (or descriptions). Discuss the features you have chosen. Do you expect them to be informative? Why?
Does the naive Bayes classifier perform as expected? How many examples from each category do you need to get stable results?
There exist many methods for extracting features from images such as local color and shape features. We will not use them in this exercise. If we had them, would they be useful for the classification problem you have set up?

Exercise 3. Decision tree for classification of Flickr data.

Choose the categories you want to classify. Use as categories either a few geographical locations (for example Zealand and Jutland) or type of landscape (nature and city, etc.). In the first case you can use a bounding box to validate the category and in the second case you need to manually open and inspect the photos.
Use the same features as in Exercise 2 to train the decision tree classifier.
Interpret the trees you get in the same way as in the book.
Validate the trained model on test data. Discuss some cases where the model worked and some cases where it didn’t. Compare with the predictions you get using the naive Bayes classifier. Are they in general in agreement?

Exercise 4. Finding independent features in Flickr data.

Use features based on tags (or descriptions .... will need to be downloaded using flickr.photos_search and the "extras" option), represent the data in a matrix and apply the non-negative matrix factorization algorithm.
Interpret the features that you get? Do they make sense?
Can we use the topics we get for unsupervised classification, that is inspect whether the documents which have a high weight in a certain topic all belong to the same category (could be geographical location or something else). Can you see similarities with the decision trees in some of the features vectors?

Exercise 5. Summing up what machine learning can be used for.

Very briefly one can say that machine learning is about learning from data. Discuss, having the applications we have worked with in these exercises in mind, what the advantages are of learning from data. Given that you want to solve these tasks can you come up with alternative non-machine learning approaches?
Which of the three methods we have learned about are predictive (that is for classification) and which aim at learning something about the features in the data? [Hint: A method may achieve both.]
What are the key differences between non-negative matrix factorization (unsupervised approach) and the classifiers we have studied previous (supervised) according to the book (as discussed in the beginning Chapter 10).

Support: +45 45 25 74 43