# Classify Twitter's Tweets Based On Naive Bayes Algorithm

## Introduction

Naive Bayes classification algorithm of Machine Learning is a very interesting algorithm. It is used as a probabilistic method and is a very successful algorithm for learning/implementation to classify text documents. Any kind of object can be classified based on a probabilistic model specification. This algorithm is based on Bayes' theorem. It is not a single algorithm but a family of algorithms. It comes under the category of supervised learning. It is used to predict information based on training datasets.

Earlier, I had explained about K-Nearest Neighbour algorithm. Conceptually, the K-NN algorithm is based on the Euclidean Distance formula, however, Naive Bayes is based on the concept of probability.

### Explanation

Let us take Twitter's tweets and build a classifier based on the given tweets. This classifier will tell whether a tweet is under the category of "Politics or Sports" or not.

The basic example of the tweet data will be classified based on the containing texts (Table1).

 Tweet Id Text Category 294051752079159296 99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? Sports 291019672701255681 On Jan 10, PM #Abe received a courtesy call from Mr. Yoshihiro Murai, Governor of Miyagi Prefecture. \nhttp://t.co/EsyP40Gl Politics 305581742104932352 Video of last week's hot topics: #2pack, #Draghi, pensions & #drug tests. @Europarltv video http://t.co/9GVBa315vM Politics 291520568396759041 10 off the over, 10 required! Captain Faulkner to bowl the last over, in close discussion with veteran Warne. The final spot on the line #BBL02 Sports

I have the below "training" data from Twitter's feed (Table2).

 Tweet Id Category Text 306624404287275009 Sports 99 days to go until the start of #ct13. Did you know Chris Gayle is the only player in the event\u2019s history to be dismissed for 99? 306481199130505216 Sports Tonight's Scottish First Division match between Dumbarton and Raith Rovers has been postponed due to a frozen pitch 304353716117590016 Politics @GSANetwork raises awareness & stands up to stop #LGBT #bullying in school & online. http://t.co/FWIG5vvVmi @glaad 304844614517547008 Politics Blasts Deja Vu. How many times have we been in this *exact* moment? Failed or ignored intel/no CCTV/blame game and innocents dead.

Below is the tweet test data that is unclassified (Table3).

 TweetId Text 301733794770190336 RT @aliwilgus: @tweetsoutloud How serious is NASA's commitment to the SLS and Orion programs, and the future of human space flight beyond ... 301576909517619200 RT @FardigJudith: This line in the President's State of the Union Address spoke to me. Check it out & share your #SOTU #CitizenRespo 256056214880919553 What is your favorite place to play badminton? Do you have a specific club in mind? Give them a shoutout! #badminton #clubs 300248062209691648 Sam wins the first game v Safarova #FedCup #AusvCze http://t.co/yjyZLnjr

I will classify or categorize the data of table2 by using the Naive Bayes Algorithm with the help of Python 3 code. Let us extract the important words from the tweeted sentences below.
1. def extract_tweet_words(tweet_words):
2.         words = []
3.         alpha_lower = string.ascii_lowercase
4.         alpha_upper = string.ascii_uppercase
5.         numbers = [str(n) for n in range(10)]
6.         for word in tweet_words:
7.         cur_word = ''
8.         for c in word:
9.         if (c not in alpha_lower) and (c not in alpha_upper) and (c not in numbers):
10.                 if len(cur_word) >= 2:
11.               words.append(cur_word.lower())
12.            cur_word = ''
13.             continue
14.             cur_word += c
15.       if len(cur_word) >= 2:
16.       words.append(cur_word.lower())
17. return words
Get Training data from a tweet.
1. def get_tweet_training_data():
2.         f = open('training.txt''r')
3.         training_data = []
5.             l = l.strip()
6.             tweet_details = l.split()
7.             tweet_id = tweet_details[0]
8.             tweet_label = tweet_details[1]
9.             tweet_words = extract_words(tweet_details[2:])
10.             training_data.append([tweet_id, tweet_label, tweet_words])
11.         f.close()
12. return training_data
Get test data from tweet that will be classified.
1. def get_tweet_test_data():
2.     f = open('test.txt''r')
3.     validation_data = []
5.         l = l.strip()
6.         tweet_details = l.split(' ')
7.         tweet_id = tweet_details[0]
8.         tweet_words = extract_words(tweet_details[1:])
9.         validation_data.append([tweet_id, '', tweet_words])
10.
11.     f.close()
12.
13. return validation_data
Get a list of words in the training data.
1. def get_words(training_data):
2.     words = []
3.     for data in training_data:
4.         words.extend(data[2])
5.     return list(set(words))
Get the probability of each word in the training data of tweet.
1. def get_tweet_word_prob(training_data, label = None):
2.     words = get_words(training_data)
3.     freq = {}
4.
5.     for word in words:
6.         freq[word] = 1
7.
8.     total_count = 0
9.     for data in training_data:
10.         if data[1] == label or label == None:
11.             total_count += len(data[2])
12.             for word in data[2]:
13.                 freq[word] += 1
14.
15.     prob = {}
16.     for word in freq.keys():
17.         prob[word] = freq[word]*1.0/total_count
18.
19.     return prob
Get the probability of a given label.
1. def get_tweet_label_count(training_data, label):
2.     count = 0
3.     total_count = 0
4.     for data in training_data:
5.         total_count += 1
6.         if data[1] == label:
7.             count += 1
8.     return count*1.0/total_count
Apply the Naive Bayes Model like below.
1. def label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob):
2.     labels = []
3.     for data in test_data:
4.         data_prob_sports = sports_prob
5.         data_prob_politics = politics_prob
6.
7.         for word in data[2]:
8.             if word in sports_word_prob:
9.                 data_prob_sports *= sports_word_prob[word]
10.                 data_prob_politics *= politics_word_prob[word]
11.             else:
12.                 continue
13.
14.         if data_prob_sports >= data_prob_politics:
15.             labels.append([data[0], 'Sports', data_prob_sports, data_prob_politics])
16.         else:
17.             labels.append([data[0], 'Politics', data_prob_sports, data_prob_politics])
18.
19.     return labels
Print the labeled or categorize the test data like below.
1. def print_labelled_data(labels):
2.     f_out = open('test_labelled_output.txt''w')
3.     for [tweet_id, label, prob_sports, prob_politics] in labels:
4.         f_out.write('%s %s\n' % (tweet_id, label))
5.
6.     f_out.close()
Read the training and test data like below.
1. training_data = get_tweet_training_data()
2. test_data = get__tweet_test_data()
Get the probability of each word.
1. word_prob = get_tweet_word_prob(training_data)
2. sports_word_prob = get_tweet_word_prob(training_data, 'Sports')
3. politics_word_prob = get_tweet_word_prob(training_data, 'Politics')
Get the probability of each label.
1. sports_prob = get_tweet_label_count(training_data, 'Sports')
2. politics_prob = get_tweet_label_count(training_data, 'Politics')
Normalize for stop words.
1. for (word, prob) in word_prob.items():
2.     sports_word_prob[word] /= prob
3.     politics_word_prob[word] /= prob
Label the test data and print it.
1. test_labels = label_tweet_data(test_data, sports_word_prob, politics_word_prob, sports_prob, politics_prob)
2. print_labelled_data(test_labels)
The output example of this algorithm will be something like below.

 TweetId Category 301733794770190336 Politics 301576909517619200 Politics 305057161682227200 Sports 286543227178328066 Politics

I have attached the complete Python code with test data, training data, and output categorized/labeled data. You can also generate the output data by running this Python code of Machine learning.

Pre-requisites for the running this code include -
1. Python 3.5
2. Jupiter's notebook will be good.

## Conclusion

Naive Bayes algorithm is based on probability and it is very good for the labeling of data.