Classify Twitter's Tweets Based On Naive Bayes Algorithm

Introduction

 
Naive Bayes classification algorithm of Machine Learning is a very interesting algorithm. It is used as a probabilistic method and is a very successful algorithm for learning/implementation to classify text documents. Any kind of object can be classified based on a probabilistic model specification. This algorithm is based on Bayes' theorem. It is not a single algorithm but a family of algorithms. It comes under the category of supervised learning. It is used to predict information based on training datasets.
 
Earlier, I had explained about K-Nearest Neighbour algorithm. Conceptually, the K-NN algorithm is based on the Euclidean Distance formula, however, Naive Bayes is based on the concept of probability.
 

Explanation

 
Let us take Twitter's tweets and build a classifier based on the given tweets. This classifier will tell whether a tweet is under the category of "Politics or Sports" or not.
 
The basic example of the tweet data will be classified based on the containing texts (Table1).
 
Tweet Id Text Category
 294051752079159296  99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? Sports
 291019672701255681  On Jan 10, PM #Abe received a courtesy call from Mr. Yoshihiro Murai, Governor of Miyagi Prefecture. \nhttp://t.co/EsyP40Gl Politics
 305581742104932352  Video of last week's hot topics: #2pack, #Draghi, pensions & #drug tests. @Europarltv video http://t.co/9GVBa315vM Politics
 291520568396759041  10 off the over, 10 required! Captain Faulkner to bowl the last over, in close discussion with veteran Warne. The final spot on the line #BBL02 Sports
 
I have the below "training" data from Twitter's feed (Table2).
 
Tweet Id Category Text
 306624404287275009  Sports  99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99?
 306481199130505216  Sports  Tonight's Scottish First Division match between Dumbarton and Raith Rovers has been postponed due to a frozen pitch
 304353716117590016  Politics  @GSANetwork raises awareness & stands up to stop #LGBT #bullying in school & online. http://t.co/FWIG5vvVmi @glaad
 304844614517547008  Politics  Blasts Deja Vu. How many times have we been in this *exact* moment? Failed or ignored intel/no CCTV/blame game and innocents dead.
 
Below is the tweet test data that is unclassified (Table3).
 
TweetId Text
 301733794770190336  RT @aliwilgus: @tweetsoutloud How serious is NASA's commitment to the SLS and Orion programs, and the future of human space flight beyond ...
 301576909517619200  RT @FardigJudith: This line in the President's State of the Union Address spoke to me. Check it out & share your #SOTU #CitizenRespo
 256056214880919553  What is your favorite place to play badminton? Do you have a specific club in mind? Give them a shoutout! #badminton #clubs
 300248062209691648  Sam wins the first game v Safarova #FedCup #AusvCze http://t.co/yjyZLnjr
 
I will classify or categorize the data of table2 by using the Naive Bayes Algorithm with the help of Python 3 code. Let us extract the important words from the tweeted sentences below.
  1. def extract_tweet_words(tweet_words):    
  2.         words = []    
  3.         alpha_lower = string.ascii_lowercase    
  4.         alpha_upper = string.ascii_uppercase    
  5.         numbers = [str(n) for n in range(10)]    
  6.         for word in tweet_words:    
  7.         cur_word = ''    
  8.         for c in word:    
  9.         if (c not in alpha_lower) and (c not in alpha_upper) and (c not in numbers):    
  10.                 if len(cur_word) >= 2:    
  11.               words.append(cur_word.lower())    
  12.            cur_word = ''    
  13.             continue    
  14.             cur_word += c    
  15.       if len(cur_word) >= 2:    
  16.       words.append(cur_word.lower())    
  17. return words   
    Get Training data from a tweet.
    1. def get_tweet_training_data():    
    2.         f = open('training.txt''r')    
    3.         training_data = []    
    4.         for l in f.readlines():    
    5.             l = l.strip()    
    6.             tweet_details = l.split()    
    7.             tweet_id = tweet_details[0]    
    8.             tweet_label = tweet_details[1]    
    9.             tweet_words = extract_words(tweet_details[2:])    
    10.             training_data.append([tweet_id, tweet_label, tweet_words])           
    11.         f.close()    
    12. return training_data   
      Get test data from tweet that will be classified.
      1. def get_tweet_test_data():    
      2.     f = open('test.txt''r')    
      3.     validation_data = []    
      4.     for l in f.readlines():    
      5.         l = l.strip()    
      6.         tweet_details = l.split(' ')    
      7.         tweet_id = tweet_details[0]    
      8.         tweet_words = extract_words(tweet_details[1:])    
      9.         validation_data.append([tweet_id, '', tweet_words])    
      10.     
      11.     f.close()    
      12.         
      13. return validation_data   
        Get a list of words in the training data.
        1. def get_words(training_data):    
        2.     words = []    
        3.     for data in training_data:    
        4.         words.extend(data[2])    
        5.     return list(set(words))   
          Get the probability of each word in the training data of tweet.
          1. def get_tweet_word_prob(training_data, label = None):    
          2.     words = get_words(training_data)    
          3.     freq = {}    
          4.        
          5.     for word in words:    
          6.         freq[word] = 1    
          7.         
          8.     total_count = 0    
          9.     for data in training_data:    
          10.         if data[1] == label or label == None:    
          11.             total_count += len(data[2])    
          12.             for word in data[2]:    
          13.                 freq[word] += 1    
          14.         
          15.     prob = {}    
          16.     for word in freq.keys():    
          17.         prob[word] = freq[word]*1.0/total_count    
          18.         
          19.     return prob   
            Get the probability of a given label.
            1. def get_tweet_label_count(training_data, label):    
            2.     count = 0    
            3.     total_count = 0    
            4.     for data in training_data:    
            5.         total_count += 1    
            6.         if data[1] == label:    
            7.             count += 1    
            8.     return count*1.0/total_count  
            Apply the Naive Bayes Model like below.
            1. def label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob):    
            2.     labels = []    
            3.     for data in test_data:    
            4.         data_prob_sports = sports_prob    
            5.         data_prob_politics = politics_prob    
            6.             
            7.         for word in data[2]:    
            8.             if word in sports_word_prob:    
            9.                 data_prob_sports *= sports_word_prob[word]    
            10.                 data_prob_politics *= politics_word_prob[word]    
            11.             else:    
            12.                 continue    
            13.     
            14.         if data_prob_sports >= data_prob_politics:    
            15.             labels.append([data[0], 'Sports', data_prob_sports, data_prob_politics])    
            16.         else:    
            17.             labels.append([data[0], 'Politics', data_prob_sports, data_prob_politics])    
            18.     
            19.     return labels   
              Print the labeled or categorize the test data like below.
              1. def print_labelled_data(labels):    
              2.     f_out = open('test_labelled_output.txt''w')    
              3.     for [tweet_id, label, prob_sports, prob_politics] in labels:    
              4.         f_out.write('%s %s\n' % (tweet_id, label))    
              5.        
              6.     f_out.close()   
                Read the training and test data like below.
                1. training_data = get_tweet_training_data()    
                2. test_data = get__tweet_test_data()   
                  Get the probability of each word.
                  1. word_prob = get_tweet_word_prob(training_data)    
                  2. sports_word_prob = get_tweet_word_prob(training_data, 'Sports')    
                  3. politics_word_prob = get_tweet_word_prob(training_data, 'Politics')  
                  Get the probability of each label.
                  1. sports_prob = get_tweet_label_count(training_data, 'Sports')    
                  2. politics_prob = get_tweet_label_count(training_data, 'Politics')   
                    Normalize for stop words.
                    1. for (word, prob) in word_prob.items():    
                    2.     sports_word_prob[word] /= prob    
                    3.     politics_word_prob[word] /= prob  
                    Label the test data and print it.
                    1. test_labels = label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob)    
                    2. print_labelled_data(test_labels)   
                      The output example of this algorithm will be something like below.
                       
                      TweetId Category
                       301733794770190336  Politics
                       301576909517619200  Politics
                       305057161682227200  Sports
                       286543227178328066  Politics
                       
                      I have attached the complete Python code with test data, training data, and output categorized/labeled data. You can also generate the output data by running this Python code of Machine learning.
                       
                      Pre-requisites for the running this code include -
                      1. Python 3.5
                      2. Jupiter's notebook will be good.

                      Conclusion

                       
                      Naive Bayes algorithm is based on probability and it is very good for the labeling of data.