Lecture 4

Social Data Modeling

Lecture 1

Textbooks

Lecture 2

Assignments & Project

	Assignment 1
	Assignment 2
	Assignment 3
	Project Assignment A
	Project Assignment B
	About Evaluations

Lecture 3

Course Overview

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Lecture 4

Page last edited by Sune Lehmann Jørgensen (sljo) 11/03-2013

Teacher

Today's lecture is on machine learning. The lecturer will be machine learning expert Ole Winther.

Learning objectives

Get the first introduction to what machine learning is and can do for us.
Work with a specific method, naïve Bayes, for classification of documents and Flickr images based upon their metadata.
Understand a) the concept of classification, b) how feature information is combined in the naïve Bayes classifier, c) why some features might provide discriminative information and other not and d) how the naïve Bayes classifier is tuned using training data.

Reading

Programming Collective Intelligence (O'Reilly 2007). Toby Segaran. Chapter 6, download [here].

Program

Today’s session will start with a short lecture about machine learning and naïve Bayes classification. After that we will work with the material in Chapter 6 of the textbook Programming Collective Intelligence by Toby Segaran. First we will work with an example from the textbook and after that turn to work with a more open-ended exercise finding various ways to divide the Flickr data using the classifier.

Exercise 1. Make sure you have read today’s text Programming Collective Intelligence Chapter 6 pages 117-141 (excluding page 127-131 and 138-139) [can be downloaded above]. Answers the following questions in your own words (the answer to Exercise 1 should not exceed two pages).

Explain the concept of classification. Come up with an example where one can use a classifier. What kind of features are meaningful to use in this example?
What are the features we use for document classification?
Explain the process of calculating (conditional) probabilities. You may construct an example with two features and two categories.
Why do we need to start with a reasonable guess? Explain-for example with an equation-how to combine the assumed probability with the frequency (empirical probability).
What is naïve about the naïve Bayes classifier? Is the assumption reasonable in the example you came up with above?
Give an example of the use of Bayes’ theorem. Explain how we calculate/set each term in the Category and Document setting using the training data and the category prior. Does the category prior have the same type of effect as the assumed probability?

Exercise 2. In this exercise you should work your way through the real data example in the book on filtering blog feeds (pages 134-136).

Reproduce the predictions given on the top of page 136. Try out a few more examples coming up with your own categories and words. Try also multiple words as input.
Make a so-called sensitivity analysis on the setting of the assumed probabilities to see how much this affects the predicted probabilities. Hint: Exercise 1 on page 140 tells where to change the assumed probabilities. What is a good strategy to set these?
Discuss whether single word features are enough to make a good classifier. Give an example where it is not. Go through the section Improving Feature Detection and discuss whether there is something you can use from it in your specific example.

Exercise 3. Classification of Flickr data. This is a more open-ended exercise. The basic idea is that you should use the same approach as in Exercise 2 on the data you got from the Flickr API.

Choose the categories you want to classify. Use as categories either a few geographical locations (for example Zealand and Jutland) or type of landscape (nature and city, etc.). In the first case you can use a bounding box to validate the category and in the second case you need to manually open and inspect the photos.
Use features based on tags (or descriptions .... will need to be downloaded using flickr.photos_search and the "extras" option). Discuss the features you have chosen. Do you expect them to be informative? Why?
Does the classifier perform as expected? How many examples from each category do you need to get stable results?
There exist many methods for extracting features from images such as local color and shape features. We will not use them in this exercise. If we had them, would they be useful for the classification problem you have set up?

Helpful links

https://github.com/cataska/programming-collective-intelligence-code

Support: +45 45 25 74 43