Page last edited by Ole Winther (olwi) 07/03-2014
Teacher
Today's session on machine learning is taught by Ole Winther.
Learning objectives
- Work with a specific method, decision trees, for classification
of housing prices and Flickr images based upon their metadata.
- Understand the concept of decision tree classification,
training of decision trees, overfitting and missing values.
Reading
- Programming Collective
Intelligence (O'Reilly 2007). Toby
Segaran. Chapter 7, download [here].
Program
Today’s session and exercise will have exactly the same
structure as last week’s. We start with a short lecture about
machine learning and decision trees. After that we will work with
the material in Chapter 7 of the textbook Programming Collective
Intelligence by Toby Segaran. First we will work with an example
from the textbook and after that turn to work with a more
open-ended exercise finding various ways to divide and interpret
the Flickr data using the decision tree classifier.
Exercise 1. Make sure you have read today’s text
Programming Collective Intelligence Chapter 7 pages 142-166
(excluding page 151-153) [can be downloaded above]. Answer
the following questions in your own words (the answer to Exercise 1
should not exceed two pages).
- We are given a decision tree like the one in Figure 7-1 or 7-4.
Explain the process of classifying a new observation with it.
- The decision tree is a classifier just like the naïve Bayes
classifier we studied last week. What is the main attraction about
the decision tree according to the book? [Hint: see both beginning
and end of Chapter 7.]
- Describe the process of building the tree from training data.
Describe how to choose which variable to split on. Describe Gini
impurity and entropy. Describe the recursive tree building
process.
- What is overfitting? How can pruning of the tree as described
in the book cure overfitting?
- What are missing values? How does the decision tree deal with
missing values? How would you deal with missing values in the naïve
Bayes classifier?
Exercise 2. In this exercise you should
work your way through the real data example in the book on modeling
home prices (pages 158-161).
- Reproduce the predictions given on the top of page 161.
There are a NoneType error when reproducing the code. The
getaddressdata() returns a None value, when the data from the url
dosen't match the pattern in the try: paragraf. This can be fixed
by adding a if-statment in the getpricelist() to see if the data is
None.
- The house price is a continuous number and we have only learned
to work with categorical outcomes. How is that problem solved in
this example?
- Try to interpret some parts of the tree? Does it give
sensible/reasonable and/or sensible but surprising results? Could
some of the results be a result of overfitting?
- Advanced question: We can test the trained model’s ability to
generalize to new data by holding back some data for testing. Take
out for example 20% of the data for testing on a tree trained on
the remainder 80% of the data. Comment on the test accuracy.
Exercise 3. Classification of Flickr data. This is a
more open-ended exercise. The basic idea is that you should use the
same approach as in Exercise 2 on the data you got from the Flickr
API.
- Choose the categories you want to classify. Use as categories
either a few geographical locations (for example Zealand and
Jutland) or type of landscape (nature and city, etc.). In the first
case you can use a bounding box to validate the category and in the
second case you need to manually open and inspect the photos.
- Use features based on tags (or descriptions .... will need to
be downloaded using flickr.photos_search and the "extras" option).
Discuss the features you have chosen. Do you expect them to be
informative? Why?
- Interpret the trees you get as you did in Exercise 2.
- Advanced question: Validate the trained model on test data.
Discuss some cases where the model worked and some case where it
didn’t.
This page will be permanently deleted and cannot be recovered. Are you sure?
|