Machine Learning Project 3: Tweet Classifier

Introduction

 
This chapter demonstrates a tweet classifier based on the naive bayes algorithm.
 

Tweet Classifier

 
Let us take Twitter's tweets and build a classifier based on the given tweets. This classifier will tell whether a tweet is under the category of "Politics or Sports" or not.
 
The basic example of the tweet data will be classified based on the containing texts (Table1).
 
Tweet Id Text Category
 294051752079159296  99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? Sports
 291019672701255681  On Jan 10, PM #Abe received a courtesy call from Mr. Yoshihiro Murai, Governor of Miyagi Prefecture. \nhttp://t.co/EsyP40Gl Politics
 305581742104932352  Video of last week's hot topics: #2pack, #Draghi, pensions & #drug tests. @Europarltv video http://t.co/9GVBa315vM Politics
 291520568396759041  10 off the over, 10 required! Captain Faulkner to bowl the last over, in close discussion with veteran Warne. The final spot on the line #BBL02 Sports
 
I have the below "training" data from Twitter's feed (Table2).
 
Tweet Id Category Text
 306624404287275009  Sports  99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99?
 306481199130505216  Sports  Tonight's Scottish First Division match between Dumbarton and Raith Rovers has been postponed due to a frozen pitch
 304353716117590016  Politics  @GSANetwork raises awareness & stands up to stop #LGBT #bullying in school & online. http://t.co/FWIG5vvVmi @glaad
 304844614517547008  Politics  Blasts Deja Vu. How many times have we been in this *exact* moment? Failed or ignored intel/no CCTV/blame game and innocents dead.
 
Below is the tweet test data that is unclassified (Table3).
 
TweetId Text
 301733794770190336  RT @aliwilgus: @tweetsoutloud How serious is NASA's commitment to the SLS and Orion programs, and the future of human space flight beyond ...
 301576909517619200  RT @FardigJudith: This line in the President's State of the Union Address spoke to me. Check it out & share your #SOTU #CitizenRespo
 256056214880919553  What is your favorite place to play badminton? Do you have a specific club in mind? Give them a shoutout! #badminton #clubs
 300248062209691648  Sam wins the first game v Safarova #FedCup #AusvCze http://t.co/yjyZLnjr
 
I will classify or categorize the data of table2 by using the Naive Bayes Algorithm with the help of Python 3 code. Let us extract the important words from the tweeted sentences below.
  1. def extract_tweet_words(tweet_words):    
  2.         words = []    
  3.         alpha_lower = string.ascii_lowercase    
  4.         alpha_upper = string.ascii_uppercase    
  5.         numbers = [str(n) for n in range(10)]    
  6.         for word in tweet_words:    
  7.         cur_word = ''    
  8.         for c in word:    
  9.         if (c not in alpha_lower) and (c not in alpha_upper) and (c not in numbers):    
  10.                 if len(cur_word) >= 2:    
  11.               words.append(cur_word.lower())    
  12.            cur_word = ''    
  13.             continue    
  14.             cur_word += c    
  15.       if len(cur_word) >= 2:    
  16.       words.append(cur_word.lower())    
  17. return words   
Get Training data from a tweet.
  1. def get_tweet_training_data():    
  2.         f = open('training.txt''r')    
  3.         training_data = []    
  4.         for l in f.readlines():    
  5.             l = l.strip()    
  6.             tweet_details = l.split()    
  7.             tweet_id = tweet_details[0]    
  8.             tweet_label = tweet_details[1]    
  9.             tweet_words = extract_words(tweet_details[2:])    
  10.             training_data.append([tweet_id, tweet_label, tweet_words])           
  11.         f.close()    
  12. return training_data   
Get test data from tweet that will be classified.
  1. def get_tweet_test_data():    
  2.     f = open('test.txt''r')    
  3.     validation_data = []    
  4.     for l in f.readlines():    
  5.         l = l.strip()    
  6.         tweet_details = l.split(' ')    
  7.         tweet_id = tweet_details[0]    
  8.         tweet_words = extract_words(tweet_details[1:])    
  9.         validation_data.append([tweet_id, '', tweet_words])    
  10.     
  11.     f.close()    
  12.         
  13. return validation_data   
Get a list of words in the training data.
  1. def get_words(training_data):    
  2.     words = []    
  3.     for data in training_data:    
  4.         words.extend(data[2])    
  5.     return list(set(words))   
Get the probability of each word in the training data of tweet.
  1. def get_tweet_word_prob(training_data, label = None):    
  2.     words = get_words(training_data)    
  3.     freq = {}    
  4.        
  5.     for word in words:    
  6.         freq[word] = 1    
  7.         
  8.     total_count = 0    
  9.     for data in training_data:    
  10.         if data[1] == label or label == None:    
  11.             total_count += len(data[2])    
  12.             for word in data[2]:    
  13.                 freq[word] += 1    
  14.         
  15.     prob = {}    
  16.     for word in freq.keys():    
  17.         prob[word] = freq[word]*1.0/total_count    
  18.         
  19.     return prob   
Get the probability of a given label.
  1. def get_tweet_label_count(training_data, label):    
  2.     count = 0    
  3.     total_count = 0    
  4.     for data in training_data:    
  5.         total_count += 1    
  6.         if data[1] == label:    
  7.             count += 1    
  8.     return count*1.0/total_count  
Apply the Naive Bayes Model like below.
  1. def label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob):    
  2.     labels = []    
  3.     for data in test_data:    
  4.         data_prob_sports = sports_prob    
  5.         data_prob_politics = politics_prob    
  6.             
  7.         for word in data[2]:    
  8.             if word in sports_word_prob:    
  9.                 data_prob_sports *= sports_word_prob[word]    
  10.                 data_prob_politics *= politics_word_prob[word]    
  11.             else:    
  12.                 continue    
  13.     
  14.         if data_prob_sports >= data_prob_politics:    
  15.             labels.append([data[0], 'Sports', data_prob_sports, data_prob_politics])    
  16.         else:    
  17.             labels.append([data[0], 'Politics', data_prob_sports, data_prob_politics])    
  18.     
  19.     return labels   
Print the labeled or categorize the test data like below.
  1. def print_labelled_data(labels):    
  2.     f_out = open('test_labelled_output.txt''w')    
  3.     for [tweet_id, label, prob_sports, prob_politics] in labels:    
  4.         f_out.write('%s %s\n' % (tweet_id, label))    
  5.        
  6.     f_out.close()   
Read the training and test data like below.
  1. training_data = get_tweet_training_data()    
  2. test_data = get__tweet_test_data()   
Get the probability of each word.
  1. word_prob = get_tweet_word_prob(training_data)    
  2. sports_word_prob = get_tweet_word_prob(training_data, 'Sports')    
  3. politics_word_prob = get_tweet_word_prob(training_data, 'Politics')  
Get the probability of each label.
  1. sports_prob = get_tweet_label_count(training_data, 'Sports')    
  2. politics_prob = get_tweet_label_count(training_data, 'Politics')   
Normalize for stop words.
  1. for (word, prob) in word_prob.items():    
  2.     sports_word_prob[word] /= prob    
  3.     politics_word_prob[word] /= prob  
Label the test data and print it.
  1. test_labels = label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob)    
  2. print_labelled_data(test_labels)   
The output example of this algorithm will be something like below.
 
TweetId Category
 301733794770190336  Politics
 301576909517619200  Politics
 305057161682227200  Sports
 286543227178328066  Politics
 

Conclusion

 
So in this chapter, you learned how to build a tweet classifier.
Author
Gul Md Ershad
0 9.5k 4m
Next » Machine Learning Porject 4: Recommendation System