|
Page last edited by David Kofoed Wind (dawi) 25/09-2014
Overview
This week is about using machine learning to make your bot a
good bot. A good bot retweets interesting content and makes
interesting tweets. So in order to solve these two tasks we need
classification methods and good features to train on. This week we
will work on the retweet part and next week on the tweets.
Additionally, we will work with the rest of chapter 6 in the NLPP
book.
Reading
- NLPP1e chapter 6.2 and 6.4-6.8.
- Slides from today’s lecture available from File sharing on
CampusNet.
Exercises
NLPP1e
- Describe in your own words the
generative approach to classification. Describe the discriminative
approach? Constrast the two.
- Make a small example on paper (say 2
features, 3 training examples and 1 test example) and go through
the training and testing steps for the two classifiers we have
learned about in chapter 6.
- In the exercises last week you
answered Choosing the Right Features in relation to
what makes a “good” tweet. (We defined a good tweet as one having
more than 10 retweets.) Take the list of features you came up with
and divide into categories that: has to do with properties of the
user who posted the tweet, has to do with the word content and has
to do with other content.
- What do you think will be most
informative and how would you go about encoding these
features?
- What is the danger of using very high
dimensional features?
Twitter
- The ambitious goal of today is to build a classification system
that our bot will use to decide what to retweet. Please document
& explain your choices for each of the following steps:
- You have already selected a
personality of your bot so it is natural that at least some of the
retweets that your bot makes will fit the personality. Select some
hashtags that fits this personality.
- Collect tweets that match one or more
of these hashtags. These tweets will be used as a training set when
we want to predict good and bad tweets. We will use last week’s
definition of good as more than 10 retweets. The tweets we collect
must be old enough that they have had time to more or less prove
their potential. The training set should contain at least 1000
tweets.
- Extract the following features for
each tweet: the number of followers of the user who posted the
tweet, the age of the user account, the number of links in the
tweet, the number of words in the tweet, the number of hashtags in
the tweet.
- Discuss why we do not use the number
of retweets as a feature?
- Do you think a classifier trained on
these features will be good at classifying tweets only from within
the topic defined by the hashtags?
- Split the training set randomly into
50% for training and 50% for testing. Train both a naïve Bayes
classifier and a maximum entropy classifier. Report training and
test performance. Assuming independent errors then according to the
binomial distribution the standard deviation of the error is given
by sqrt(e*(1-e)/Ntest), where e is the error rate and Ntest is the
number of test example. Do you think that the difference in error
you observe is significant?
- Now we
have a system to select tweets to retweet and we can run A/B
testing to see whether it works in practice. You will retweet twice
per day, selecting from the tweets you received in the previous 24
hours. One tweet is the tweet you receive that is predicted to have
the highest probability of being good and the second is the tweet
with the highest number of retweets. (Make a rule for the case
where these two are the same.) You should set up a system to see
how many of your followers who retweet your retweets of the two
types. Again, think about the timing of retweeting.
This page will be permanently deleted and cannot be recovered. Are you sure?
|