Wiki
Social Data Modeling

Lecture 5
Page last edited by Ole Winther (olwi) 07/03-2014

 

Teacher

Today's session on machine learning is taught by Ole Winther.

 

Learning objectives

  • Work with a specific method, decision trees, for classification of housing prices and Flickr images based upon their metadata.
  • Understand the concept of decision tree classification, training of decision trees, overfitting and missing values.

Reading

  • Programming Collective Intelligence  (O'Reilly 2007). Toby Segaran. Chapter 7, download [here].

Program

Today’s session and exercise will have exactly the same structure as last week’s. We start with a short lecture about machine learning and decision trees. After that we will work with the material in Chapter 7 of the textbook Programming Collective Intelligence by Toby Segaran. First we will work with an example from the textbook and after that turn to work with a more open-ended exercise finding various ways to divide and interpret the Flickr data using the decision tree classifier.   

 

Exercise 1.  Make sure you have read today’s text Programming Collective Intelligence Chapter 7 pages 142-166 (excluding page 151-153) [can be downloaded above].  Answer the following questions in your own words (the answer to Exercise 1 should not exceed two pages).

  • We are given a decision tree like the one in Figure 7-1 or 7-4. Explain the process of classifying a new observation with it.
  • The decision tree is a classifier just like the naïve Bayes classifier we studied last week. What is the main attraction about the decision tree according to the book? [Hint: see both beginning and end of Chapter 7.]
  • Describe the process of building the tree from training data. Describe how to choose which variable to split on. Describe Gini impurity and entropy. Describe the recursive tree building process.
  • What is overfitting? How can pruning of the tree as described in the book cure overfitting?
  • What are missing values? How does the decision tree deal with missing values? How would you deal with missing values in the naïve Bayes classifier?

 

Exercise 2.   In this exercise you should work your way through the real data example in the book on modeling home prices (pages 158-161).

  • Reproduce the predictions given on the top of page 161.  

                  There are a NoneType error when reproducing the code. The getaddressdata() returns a None value, when the data from the url dosen't match the pattern in the try: paragraf. This can be fixed by adding a if-statment in the getpricelist() to see if the data is None.

  • The house price is a continuous number and we have only learned to work with categorical outcomes. How is that problem solved in this example?
  • Try to interpret some parts of the tree? Does it give sensible/reasonable and/or sensible but surprising results? Could some of the results be a result of overfitting?
  • Advanced question: We can test the trained model’s ability to generalize to new data by holding back some data for testing. Take out for example 20% of the data for testing on a tree trained on the remainder 80% of the data. Comment on the test accuracy.

 

Exercise 3. Classification of Flickr data. This is a more open-ended exercise. The basic idea is that you should use the same approach as in Exercise 2 on the data you got from the Flickr API.

  • Choose the categories you want to classify. Use as categories either a few geographical locations (for example Zealand and Jutland) or type of landscape (nature and city, etc.). In the first case you can use a bounding box to validate the category and in the second case you need to manually open and inspect the photos.
  • Use features based on tags (or descriptions .... will need to be downloaded using flickr.photos_search and the "extras" option). Discuss the features you have chosen. Do you expect them to be informative? Why?
  • Interpret the trees you get as you did in Exercise 2.
  • Advanced question: Validate the trained model on test data. Discuss some cases where the model worked and some case where it didn’t.
Support: +45 45 25 74 43