Page last edited by Ole Winther (olwi) 18/03-2014
Teacher
Today's session on machine learning is taught by Ole Winther.
Learning objectives
- Understand in detail the PageRank algorithm for ranking nodes
in a graph based upon the links in the graph.
- Implement the PageRank algorithm and investigate to what degree
actual user surf paths agree with the predictions of PageRank.
Reading
- Programming Collective
Intelligence (O'Reilly 2007). Toby
Segaran. Chapter 4, download [here].
Program
In today’s session and exercise we will continue with machine
learning. We start with a short lecture about ranking pages in
graphs using the link structure. After that we will work with the
material in Chapter 4 on the Google’s PageRank algorithm. This
topic ties in well with the coming weeks’ focus on graphs.
We will work with the Wikispeedia
that contains data on a condensed version of Wikipedia with 4604
articles. Exercise 1 is about calculating the PageRank of the graph
and in Exercise 2 we will use actual user data to investigate how
users actually surf.
Exercise 1. Make sure you have read today’s text
Programming Collective Intelligence Chapter 4 pages 54-56 and 69-73
[can be downloaded from the link above]. You will need this
information when you implement the PageRank algorithm.
- Download the Wikispeedia data and load the link structure of
the graph.
- Implement the PageRank algorithm.
The pagerank implemented in the book uses a sqlite db (
here is the source from github). You will have to implement
your own version that uses the links from the wikispeedia
dataset.
Here is some pseudocode for the algorithm:
for each link a -> b in wikispedia
add b to the outgoing links for a
add a to the incoming links for b
for n iterations
for each url
update the pagerank for
the url according to the formula
To speed up things, use dictionaries for storing incoming and
outgoing links
- List the top 100 and bottom 100 RageRank articles and comment
on the results.
- Compare your results to that of some of the other groups. Are
the overall results the same or nearly the same? Identify factors
that contribute to a potential difference.
Exercise 2. Now we will work with the actual user
surf path data from Wikispeedia.
- Write code to load the user path data.
- Calculate the number of visits to each page.
- List the top 100 and bottom 100 articles in terms of page
visits and list their frequencies.
- Make a short description of set-up of the Wikispeedia
experiments.
- Does the lists from PageRank and Wikispeedia page visits look
similar?
- Identify confounding factors in the Wikispeedia data that make
the results different. Could for example be how the start and end
points have been selected.
- PageRank calculates visit frequencies for the random surfer
that surf infinite number of steps. Discuss how that might be
different from how the users in the Wikispeedia experiment select
their surf strategy.
This page will be permanently deleted and cannot be recovered. Are you sure?
|