Wiki
02806 Course Wiki

Assignment 2
Page last edited by Sune Lehmann Jørgensen (sljo) 24/03-2014

Formalia: Please read  http://dtu.cnwiki.dk/02806/page/1159/assignments-project carefully before proceeding. This page contains information about formatting (including restrictions on group size, etc), and many other aspects of handing in the assignment.  If you fail to follow these simple instructions, it will negatively impact your grade! 

 

Due date: The due date is March 24th 23:59. Hand in via CampusNet upload of relevant .ipynb files, etc.

 

Assignment 1

 

Exercise 1. Learn about Naive Bayes

  • Read PCI Chapter 6 pp 117-127 carefully.
  • Work through the book's little spam filtering example (include in your notebook). Create the docclass.py file on your own - based on the code in the book. The docclass.py file should be in the same directory as your IPython notebook, that will allow you to import the classes/functions into your Notebook.
  • Explain what happens when you run sampletrain(cl) on p 121?
  • Express in your own words what a conditional probability is. In the spam-example, if a word has conditional probability = 0.5 of being "good", what does that mean about that word in the training data?
  • What's the "Assumed probability" on page 122.
  • How is the prior probability of a category (e.g. good/bad) defined?
  • Explain Bayes' Theorem in your own words.
  • In what sense is Naive Bayes "naive"?

Exercise 2.   Apply Naive Bayes to a real dataset. (You will also secretly generate your own word list)

  • Download some tweets. [link   here] Each tweet in this sample has an emoticon happy/sad. Happy emoticons [":-)" or ":)"] belong to the class "happy" and sad emoticons [":-(" or ":("] belong to the class "sad". What is the probability of a happy vs a sad tweet? How does this probability relate to PCI chapter 6?
  • Use the code you generated in exercise 1 (above) to train your classifier to separate "happy" from "sad" tweets. You can use the in operator to do this. So the line ":-)" in tweet_text will evaluate True   if the tweet contains the string '':)".   You may decide what to do with tweets that contain both happy and sad emoticons (no right answer here, but justify your choice). 
  • The getwords(doc) function (PCI p 118) should get rid of emoticons for learning, but you may want to first remove twitter usernames (everything that starts with @, e.g. @suneman). Should you get rid of web-pages (explain your answer)? Train the classifier on a random 50% of the tweets and see how well you do on classifying the remaining 50%.

Exercise 3.  Make sure you have read Programming Collective Intelligence Chapter 7 pages 142-166 (excluding page 151-153).  Answer the following questions in your own words (the answer to Exercise 1 should not exceed two pages).

  • We are given a decision tree like the one in Figure 7-1 or 7-4. Explain the process of classifying a new observation with it.
  • The decision tree is a classifier just like the naïve Bayes classifier we studied last week. What is the main attraction about the decision tree according to the book? [Hint: see both beginning and end of Chapter 7.]
  • Describe the process of building the tree from training data. Describe how to choose which variable to split on. Describe Gini impurity and entropy. Describe the recursive tree building process.
  • What is overfitting? How can pruning of the tree as described in the book cure overfitting?
  • What are missing values? How does the decision tree deal with missing values? How would you deal with missing values in the naïve Bayes classifier?

Exercise 4.  In this exercise you should work your way through the real data example in the book on modeling home prices (pages 158-161). The code from the book can be found [  here]. There is an error in treepredict.py in the prune function (page 155): delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2) should be delta=entropy(tb+fb)-(entropy(tb)+entropy(fb))/2.  

  • Reproduce the predictions given on the top of page 161.  Note: at the top of the tree in figure 7-6 the root node should read 3:3.0   (not 3:3:0). This refers to the number of bathrooms.

There are a NoneType error when reproducing the code. The getaddressdata() returns a None value. This can be fixed by adding an if-statement in the getpricelist() to see if the data is None.

           Use a try except block to handle invalid data

           If your API key does not work, try the one from the book

           The API is very slow, so it is a good idea to download the results into a file once, and then read from the file the rest of the times.

  • The house price is a continuous number and we have only learned to work with categorical outcomes. How is that problem solved in this example?
  • Try to interpret some parts of the tree? Does it give sensible/reasonable and/or sensible but surprising results? Could some of the results be a result of overfitting?
  • We can test the trained model’s ability to generalize to new data by holding back some data for testing. Take out for example 20% of the data for testing on a tree trained on the remainder 80% of the data. Comment on the test accuracy (hint: you can use http://en.wikipedia.org/wiki/Mean_squared_error)

Exercise 5.  Make sure you have read Programming Collective Intelligence Chapter 4 pages 54-56 and 69-73 [can be downloaded from the link above]. You will need this information when you implement the PageRank algorithm.

  • Download the Wikispeedia data and load the link structure of the graph.
  • Implement the PageRank algorithm.

The pagerank implemented in the book uses a sqlite db (  here is the source from github). You will have to implement your own version that uses the links from the wikispeedia dataset.
Here is some pseudocode for the algorithm:

for each link a -> b in wikispedia
    add b to the outgoing links for a
    add a to the incoming links for b

for n iterations
    for each url
        update the pagerank for the url according to the formula


To speed up things, use dictionaries for storing incoming and outgoing links

  • List the top 100 and bottom 100 RageRank articles and comment on the results.
  • Compare your results to that of some of the other groups. Are the overall results the same or nearly the same? Identify factors that contribute to a potential difference.

Exercise 6.  Now we will work with the actual user surf path data from Wikispeedia. 

  • Write code to load the user path data.
  • Calculate the number of visits to each page.
  • List the top 100 and bottom 100 articles in terms of page visits and list their frequencies.
  • Make a short description of set-up of the Wikispeedia experiments.   
  • Does the lists from PageRank and Wikispeedia page visits look similar?
  • Identify confounding factors in the Wikispeedia data that make the results different. Could for example be how the start and end points have been selected.
  • PageRank calculates visit frequencies for the random surfer that surf infinite number of steps. Discuss how that might be different from how the users in the Wikispeedia experiment select their surf strategy.
Support: +45 45 25 74 43