Lecture 7

02806 Course Wiki

Lecture 5

Textbooks

Lecture 7

Lecture 9

Assignment 2

Assignment 1

Assignment 3

Course Overview

Project Assignment A

Assignments & Project

Lecture 7

Page last edited by Ole Winther (olwi) 18/03-2014

Teacher

Today's session on machine learning is taught by Ole Winther.

Learning objectives

Understand in detail the PageRank algorithm for ranking nodes in a graph based upon the links in the graph.
Implement the PageRank algorithm and investigate to what degree actual user surf paths agree with the predictions of PageRank.

Reading

Programming Collective Intelligence (O'Reilly 2007). Toby Segaran. Chapter 4, download [here].

Program

In today’s session and exercise we will continue with machine learning. We start with a short lecture about ranking pages in graphs using the link structure. After that we will work with the material in Chapter 4 on the Google’s PageRank algorithm. This topic ties in well with the coming weeks’ focus on graphs.

We will work with the Wikispeedia that contains data on a condensed version of Wikipedia with 4604 articles. Exercise 1 is about calculating the PageRank of the graph and in Exercise 2 we will use actual user data to investigate how users actually surf.

Exercise 1. Make sure you have read today’s text Programming Collective Intelligence Chapter 4 pages 54-56 and 69-73 [can be downloaded from the link above]. You will need this information when you implement the PageRank algorithm.

Download the Wikispeedia data and load the link structure of the graph.
Implement the PageRank algorithm.

The pagerank implemented in the book uses a sqlite db ( here is the source from github). You will have to implement your own version that uses the links from the wikispeedia dataset.
Here is some pseudocode for the algorithm:

for each link a -> b in wikispedia
    add b to the outgoing links for a
    add a to the incoming links for b

for n iterations
    for each url
        update the pagerank for the url according to the formula

To speed up things, use dictionaries for storing incoming and outgoing links

List the top 100 and bottom 100 RageRank articles and comment on the results.
Compare your results to that of some of the other groups. Are the overall results the same or nearly the same? Identify factors that contribute to a potential difference.

Exercise 2. Now we will work with the actual user surf path data from Wikispeedia.

Write code to load the user path data.
Calculate the number of visits to each page.
List the top 100 and bottom 100 articles in terms of page visits and list their frequencies.
Make a short description of set-up of the Wikispeedia experiments.
Does the lists from PageRank and Wikispeedia page visits look similar?
Identify confounding factors in the Wikispeedia data that make the results different. Could for example be how the start and end points have been selected.
PageRank calculates visit frequencies for the random surfer that surf infinite number of steps. Discuss how that might be different from how the users in the Wikispeedia experiment select their surf strategy.

Support: +45 45 25 74 43