Wiki
Social Graphs and Interactions

Lecture 5
Page last edited by David Kofoed Wind (dawi) 25/09-2014

Overview

This week is about using machine learning to make your bot a good bot. A good bot retweets interesting content and makes interesting tweets. So in order to solve these two tasks we need classification methods and good features to train on. This week we will work on the retweet part and next week on the tweets. Additionally, we will work with the rest of chapter 6 in the NLPP book.

 

Reading

  • NLPP1e chapter 6.2 and 6.4-6.8.
  • Slides from today’s lecture available from File sharing on CampusNet.

Exercises

 

NLPP1e

  1. Describe in your own words the generative approach to classification. Describe the discriminative approach? Constrast the two.
  2. Make a small example on paper (say 2 features, 3 training examples and 1 test example) and go through the training and testing steps for the two classifiers we have learned about in chapter 6. 
  3. In the exercises last week you answered Choosing the Right Features in relation to what makes a “good” tweet. (We defined a good tweet as one having more than 10 retweets.) Take the list of features you came up with and divide into categories that: has to do with properties of the user who posted the tweet, has to do with the word content and has to do with other content.
  4. What do you think will be most informative and how would you go about encoding these features?
  5. What is the danger of using very high dimensional features? 

Twitter

  • The ambitious goal of today is to build a classification system that our bot will use to decide what to retweet. Please document & explain your choices for each of the following steps:
  1. You have already selected a personality of your bot so it is natural that at least some of the retweets that your bot makes will fit the personality. Select some hashtags that fits this personality.
  2. Collect tweets that match one or more of these hashtags. These tweets will be used as a training set when we want to predict good and bad tweets. We will use last week’s definition of good as more than 10 retweets. The tweets we collect must be old enough that they have had time to more or less prove their potential. The training set should contain at least 1000 tweets.
  3. Extract the following features for each tweet: the number of followers of the user who posted the tweet, the age of the user account, the number of links in the tweet, the number of words in the tweet, the number of hashtags in the tweet.
  4. Discuss why we do not use the number of retweets as a feature?
  5. Do you think a classifier trained on these features will be good at classifying tweets only from within the topic defined by the hashtags?  
  6. Split the training set randomly into 50% for training and 50% for testing. Train both a naïve Bayes classifier and a maximum entropy classifier. Report training and test performance. Assuming independent errors then according to the binomial distribution the standard deviation of the error is given by sqrt(e*(1-e)/Ntest), where e is the error rate and Ntest is the number of test example. Do you think that the difference in error you observe is significant?
  7. Now we have a system to select tweets to retweet and we can run A/B testing to see whether it works in practice. You will retweet twice per day, selecting from the tweets you received in the previous 24 hours. One tweet is the tweet you receive that is predicted to have the highest probability of being good and the second is the tweet with the highest number of retweets. (Make a rule for the case where these two are the same.) You should set up a system to see how many of your followers who retweet your retweets of the two types. Again, think about the timing of retweeting.  
Support: +45 45 25 74 43