Page last edited by Sune Lehmann Jørgensen (sljo) 24/03-2014
Formalia: Please read http://dtu.cnwiki.dk/02806/page/1159/assignments-project carefully before proceeding. This
page contains information about formatting (including restrictions
on group size, etc), and many other aspects of handing in the
assignment. If you fail to
follow these simple instructions, it will negatively impact your
grade!
Due date: The due date is March 24th
23:59. Hand in via CampusNet upload of relevant .ipynb files,
etc.
Assignment 1
Exercise 1. Learn about Naive Bayes
- Read PCI Chapter 6 pp 117-127 carefully.
- Work through the book's little spam filtering example (include
in your notebook). Create the docclass.py file on your own - based
on the code in the book. The docclass.py file should be in
the same directory as your IPython notebook, that will allow
you to import the classes/functions into your Notebook.
- Explain what happens when you
run sampletrain(cl) on p 121?
- Express in your own words what a conditional probability is. In
the spam-example, if a word has conditional probability = 0.5 of
being "good", what does that mean about that word in the training
data?
- What's the "Assumed probability" on page 122.
- How is the prior probability of a category (e.g. good/bad)
defined?
- Explain Bayes' Theorem in your own words.
- In what sense is Naive Bayes "naive"?
Exercise 2. Apply Naive Bayes to a real dataset. (You
will also secretly generate your own word list)
- Download some tweets. [link
here] Each tweet in this sample has an
emoticon happy/sad. Happy emoticons [":-)" or ":)"] belong to the
class "happy" and sad emoticons [":-(" or ":("] belong to the class
"sad". What is the probability of a happy vs a sad tweet? How does
this probability relate to PCI chapter 6?
- Use the code you generated in exercise 1 (above) to train your
classifier to separate "happy" from "sad" tweets. You can use
the in operator to do this. So the
line ":-)" in tweet_text will
evaluate True if the tweet
contains the string '':)". You may
decide what to do with tweets that contain both happy and sad
emoticons (no right answer here, but justify your
choice).
- The getwords(doc) function (PCI p
118) should get rid of emoticons for learning, but you may want to
first remove twitter usernames (everything that starts with @, e.g.
@suneman). Should you get rid of web-pages (explain your answer)?
Train the classifier on a random 50% of the tweets and see how well
you do on classifying the remaining 50%.
Exercise 3. Make sure you have read
Programming Collective Intelligence Chapter 7 pages
142-166 (excluding page 151-153). Answer the following
questions in your own words (the answer to Exercise 1 should not
exceed two pages).
- We are given a decision tree like the one in Figure 7-1 or 7-4.
Explain the process of classifying a new observation with it.
- The decision tree is a classifier just like the naïve Bayes
classifier we studied last week. What is the main attraction about
the decision tree according to the book? [Hint: see both beginning
and end of Chapter 7.]
- Describe the process of building the tree from training data.
Describe how to choose which variable to split on. Describe Gini
impurity and entropy. Describe the recursive tree building
process.
- What is overfitting? How can pruning of the tree as described
in the book cure overfitting?
- What are missing values? How does the decision tree deal with
missing values? How would you deal with missing values in the naïve
Bayes classifier?
Exercise 4. In this exercise you should work your
way through the real data example in the book on modeling home
prices (pages 158-161). The code from the book can be found [
here]. There is an error in treepredict.py in the prune
function (page 155):
delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2) should
be delta=entropy(tb+fb)-(entropy(tb)+entropy(fb))/2.
- Reproduce the predictions given on the top of page 161.
Note: at the top of the tree in figure 7-6 the root node
should read 3:3.0
(not 3:3:0). This refers to the number
of bathrooms.
There are a NoneType error when reproducing the code. The
getaddressdata() returns a None value. This can be fixed by adding
an if-statement in the getpricelist() to see if the data is
None.
Use
a try except block to handle invalid data
If
your API key does not work, try the one from the book
The
API is very slow, so it is a good idea to download the results into
a file once, and then read from the file the rest of the times.
- The house price is a continuous number and we have only learned
to work with categorical outcomes. How is that problem solved in
this example?
- Try to interpret some parts of the tree? Does it give
sensible/reasonable and/or sensible but surprising results? Could
some of the results be a result of overfitting?
- We can test the trained model’s ability to generalize to new
data by holding back some data for testing. Take out for example
20% of the data for testing on a tree trained on the remainder 80%
of the data. Comment on the test accuracy (hint: you can
use http://en.wikipedia.org/wiki/Mean_squared_error)
Exercise 5. Make sure you have read
Programming Collective Intelligence Chapter 4 pages 54-56
and 69-73 [can be downloaded from the link above]. You will need
this information when you implement the PageRank algorithm.
- Download the Wikispeedia data and load the link structure of
the graph.
- Implement the PageRank algorithm.
The pagerank implemented in the book uses a sqlite db (
here is the source from github). You will have to implement
your own version that uses the links from the wikispeedia
dataset.
Here is some pseudocode for the algorithm:
for each link a -> b in wikispedia
add b to the outgoing links for a
add a to the incoming links for b
for n iterations
for each url
update the pagerank for
the url according to the formula
To speed up things, use dictionaries for storing incoming and
outgoing links
- List the top 100 and bottom 100 RageRank articles and comment
on the results.
- Compare your results to that of some of the other groups. Are
the overall results the same or nearly the same? Identify factors
that contribute to a potential difference.
Exercise 6. Now we will work with the actual
user surf path data from Wikispeedia.
- Write code to load the user path data.
- Calculate the number of visits to each page.
- List the top 100 and bottom 100 articles in terms of page
visits and list their frequencies.
- Make a short description of set-up of the Wikispeedia
experiments.
- Does the lists from PageRank and Wikispeedia page visits look
similar?
- Identify confounding factors in the Wikispeedia data that make
the results different. Could for example be how the start and end
points have been selected.
- PageRank calculates visit frequencies for the random surfer
that surf infinite number of steps. Discuss how that might be
different from how the users in the Wikispeedia experiment select
their surf strategy.
This page will be permanently deleted and cannot be recovered. Are you sure?
|