# Machine Learning: Naive Bayes

## Introduction

In the previous chapter, we studied Decision Tree.

In this chapter, we will study naive bayes.

Note: if you can correlate anything with yourself or your life, there are greater chances of understanding the concept. So try to understand everything by relating it to humans.

## Key Terms

1. Coin
A coin has two sides, Head and Tail. If an event consists of more than one coin, then coins are considered as distinct, if not otherwise stated.

2. Die
The die has six faces marked 11, 2, 3, 4, 5, and 6. If we have more than one dice, then all dice are considered as distinct, if not otherwise stated.

3. Playing Cards
A pack of playing cards has 52 cards. There are 4 suits (spade, heart, diamond, and club) each having 13 cards. There are two colors, red (heart and diamond) and black (spade and club) each having 26 cards. In 13 cards of each suit, there are 3 face cards namely king, queen and jack so there are in all ’12 face cards. Also, there are 16 honor cards, 4 of each suit namely ace, king, queen, and jack.

### Types of Experiments

1. Deterministic Experiment Those experiments, which when repeated under identical conditions produce the same result or outcome are known as a deterministic experiment,

2. Probabilistic/Random Experiment Those experiments, which when repeated under identical conditions, do not produce the same outcome every time but the outcome in a trial is one of the several possible outcomes called a random experiment.

### Important Definitions

(i)     Trial
Let a random experiment, be repeated under identical conditions, then the experiment is called a Trial.
(ii)    Sample Space
The set of all possible outcomes of an experiment is called the sample space of the experiment and it is denoted by S.
(iii)   Event
A subset of the sample space associated with a random experiment is called an event or case.
(iv)   Sample Points
The outcomes of an experiment are called the sample point.
(v)    Certain Event
An event that must occur, whatever be the outcomes, is called a certain or sure event.
(vi)   Impossible Event
An event that cannot occur in a particular random experiment is called an impossible event.
(vii)  Elementary Event
An event certainly only one sample point is called elementary event or indecomposable events.
(viii) Favorable Event
Let S be the sample space associated with a random experiment and let E ⊂ S. Then, the elementary events belonging to E are known as the favorable event to E.
(ix)  Compound Events
An event certainly more than one sample point is called compound events or decomposable events.

### Probability

If there are n elementary events associated with a random experiment and m of them are favorable to an event A, then the probability of happening or occurrence of A, denoted by P(A), is given by P(A) = m / n = Number of favourable cases / Total number of possible cases

### Types of Events

(i) Equally Likely Events
The given events are said to be equally likely if none of them is expected to occur in preference to the other.

(ii) Mutually Exclusive Events
A set of events is said to be mutually exclusive if the happening of one excludes the happening of the other. If A and B are mutually exclusive, then P(A ∩ B) = 0

(iii) Exhaustive Events
A set of events is said to be exhaustive if the performance of the experiment always results in the occurrence of at least one of them. If E1, E2, … , En are exhaustive events, then El ∪ E2 ∪ … ∪ En = S i.e., P(E1 ∪ E2 ∪ E3 ∪ … ∪ En) = 1

(iv) Independent Events
Two events A and B associated with a random experiment are independent if the probability of occurrence or non-occurrence of A is not affected by the occurrence or non-occurrence of B. i.e., P(A ∩ B) = P(A) P(B)

### The complement of an Event

Let A be an event in a sample space S~the complement of A is the set of all sample points of the space other than the sample point in A and it is denoted by, A’ or A = {n : n ∈ S, n ∉ A}
(i) P(A ∪ A’) = S
(ii) P(A ∩ A’) = φ
(iii) P(A’)’ = A

### Partition of a Sample Space

The events A1, A2,…., An represent a partition of the sample space S, if they are pairwise disjoint, exhaustive and have non-zero probabilities. i.e.,
(i) Ai ∩ Aj = φ; i ≠ j; i,j= 1,2, …. ,n
(ii) A1 ∪ A2 ∪ … ∪ An = S
(iii) P(Ai) > 0, ∀ i = 1,2, …. ,n

### Important Results on Probability

(i) If a set of events A1, A2,…., An are mutually exclusive, then
A1 ∩ A2 ∩ A3 ∩ …∩ An = φ
P(A1 ∪ A2 ∪ A3 ∪… ∪ An) = P(A1) + (A2) + … + P(An) and A1 ∩ A2 ∩ A3 ∩ …∩ An = 0

(ii) If a set of events A1, A2,…., An are exhaustive, then P(A1 ∪ A2 ∪ … ∪ An) = 1

(iii) The probability of an impossible event is O. i.e., P(A) = 0 if A is an impossible event. ,

(iv) Probability of any event in a sample space is 1. i.e., P(A) = 1

(v) Odds in favour of A = P(A) / P(A)

(vi) Odds in Against of A = P(A) / P(A)

(a) For two events A and B
P(A ∪ B) = P(A) + P(B) – P(A ∩ B)
(b) For three events A, B and C
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) -P(A ∩ B) – P(B ∩ C) – P(A ∩ C) + P(A ∩ B ∩ C)
(c) For n events A1, A2,…., An

(viii) If A and B are two events, then P(A ∩ B) ≤ P(A) ≤ P(A ∪ B) ≤ P(A) + P(B)

(ix) If A and B are two events associated with a random experiment, then
(a) P(A ∩ B) = P(B) – P(A ∩ B)
(b) P(A ∩ B) = P(A) – P(A ∩ B)
(c)P [(A ∩ B) ∪ (A ∩ B)] = P(A) + P(B) – 2P(A ∩ B)
(d) P(A ∩ B) = 1- P(A ∪ B)
(e) P(A ∪ B) = 1- P(A ∩ B)
(f) P(A) = P(A ∩ B) + P(A ∩ B).
(g) P(B) = P(A ∩ B) + P(B ∩ A)

(x)
(a) P (exactly one of A, B occurs) = P(A) + P(B) – 2P(A ∩ B) = P(A ∪ B) – P(A ∩ B)
(b) P(neither A nor B) = P(A’ ∩ B’) = 1 – P(A ∪ B)

(xi) If A, B and C are three events, then
(a) P(exactly one of A, B, C occurs) = P(A) + P(B) + P(C) – 2P(A ∩ B) – 2P(B ∩ C) – 2P(A ∩ C) + 3P(A ∩ B ∩ C)
(b) P (at least two of A, B, C occurs) = P(A ∩ B) + P(B ∩ C) + P(C ∩ A) – 2P(A ∩ B ∩ C)
(c) P (exactly two of A, B, C occurs) . = P(A ∩ B) + P(B ∩ C) + P(A ∩ C) – 3P(A ∩ B ∩ C)

(xii)
(a) P(A ∪ B) = P(A) + P(B), if A and B are mutually exclusive events.
(b) P(A ∪ B ∪ C) = P(A) + P(B) + P(C), if A, Band C are mutually exclusive events.

(xiii) P(A) = 1- P(A)

(xiv) P(A ∪ B) = P(S) = 1, P(φ) = 0

(xv) P(A ∩ B) = P(A) x P(B), if A and B are independent events.

(xvi) If A1, A2,…., An are independent events associated with a random experiment, the probability of occurrence of at least one
= P(A1 ∪ A2 ∪…. ∪ An)
= 1 – P(A1 ∪ A2 ∪…. ∪ An)
= 1 – P(A1)P(A2)…P(An)

(xvii) If B ⊆ A, then P(A ∩ B) = P(A) – P(B)

### Conditional Probability

Let A and B be two events associated with a random experiment, then, the probability of occurrence of event A under the condition that B has already occurred and P(B) ≠ 0, is called the conditional probability.
i.e., P(A/B) = P(A ∩ B) / P(B)

If A has already occurred and P (A) ≠ 0, then
P(B/A) = P(A ∩ B) / P(A)

Also, P(A / B) + P (A / B) = 1

### Multiplication Theorem on Probability

(i) If A and B are two events associated with a random experiment, then
P(A ∩ B) = P(A)P(B /A), IF P(A) ≠ 0
OR
P(A ∩ B) = P(B)P(A /B), IF P(B) ≠ 0

(ii) If A1, A2,…., An are n events associated with a random experiment, then
P(A1 ∩ A2 ∩…. ∩ An) = P(A1) P(A2 / A1) P(A3 / (A1 ∩ A2)) …P(An / (A1 ∩ A2 ∩ A3 ∩…∩A n – 1))

### Total Probability

Let S be the sample space and let E1, E2,…., En be n mutually exclusive and exhaustive events associated with a random experiment. If A is any event which occurs with E1 or E2 or … or En then

P(A) = P(E1)P(A / E1) + P(E2)P(A / E2) + … + P(En) P(A / En)

### Baye’s Theorem

Let S be the sample space and let E1, E2,…, En, be n mutually exclusive and exhaustive events associated With a random experiment. If A is any event which occurs with E1 or E2 or … or En then the probability of occurrence of Ei, when A occurred,

where,
1. P (Ei), i = 1,2, n are known as the prior probabilities
2. P (A / Ei), i = 1,2, , n are called the likelihood probabilities
3. P (Ei / A), i = 1, 2, … ,n is called the posterior probabilities OR where A and B are events and P ( B ) ≠ 0.
1. P ( A ∣ B ) is a conditional probability: the likelihood of event A occurring given that B is true.
2. P ( B ∣ A ) is also a conditional probability: the likelihood of event B occurring given that A is true.
3. P ( A ) and P ( B ) are the probabilities of observing A and Bindependently of each other; this is known as the marginal probability.

### Random Variable

Let U or S be a sample space associated with a given random experiment. A real-valued function X defined on U or S, i:e.,

X: U → R is called a random variable.

There are two types of random variables.

(i) Discrete Random Variable
If the range of the real function X: U → R is a finite set or an infinite set of real numbers, it is called a discrete random variable.

(ii) Continuous Random Variable
If the range of X is an interval (a, b) of R, then X is called a continuous random variable. e.g., In tossing of two coins S = {HH, HT, TH, TT}, let X denotes the number of heads in the tossing of two coins, then X(HH) = 2, X(TH) = 1, X(TT) = 0

### Probability Distribution

If a random variable X takes values X1, X2,…., Xn with respective probabilities P1, P2,…., Pn then is known as the probability distribution of X, or Probability distribution gives the values of the random variable along with the corresponding probabilities.

### Mathematical Expectation/Mean

If X is a discrete random variable which assume values X1, X2,…., Xn with respective probabilities P1, P2,…., Pn then the mean x of X is defined as
E(X) = X = P1X1 + P2X2 + … + PnXn = Σni = 1 PiXi

Important Results

(i) Variance V(X) = σ2x = E(X2) – (E(X))2
where, E(X2) = Σni = 1 x2iP(xi)

(ii) Standard Deviation √V(X) = σx = √E(X2) – (E(X))2

(iii) If Y = a X + b, then
(a) E(Y) = E(aX + b) = aE(X) + b
(b) σ2y = a2V(Y) = a2σ2x
(c) σy = √V(Y) = |a|σx

(iv) If Z = aX2 + bX + c, then
E(Z) = E(aX2 + bX + c) = aE(X2) + bE(X) + c

## Baye's Theorem Explanation using an example

Let us try to understand the above formula through an example:

Question

Talking about C-SharpCorner, if a person visits C-SharpCorner, the chances of he/she revisiting are 60%, the chances of a person liking a particular article are 75%. Chances of a person liking the article and coming back are 75%. So we need to find the probability of a person re-visiting the website given that he/she doesn't like the article.

Solution

A: A person re-visits the website
B:  A person likes an article

So, P(A) = 0.6, P(A') = 0.4
P(B) = 0.75 , P(B') =0.25
P(A|B) = 0.75, P(A'|B) = 0.25

P(B|A') = P((A'|B)*P(B))/ P(A')
= (0.25*0.75)/0.4
= 0.46875

So, from the above calculations, it is clear that a person will revisit ~47% time if he/she doesn't like a particular article

## When is Naive Bayes Classifier Used?

1. Real-time prediction
Naive Bayes Algorithm is fast and always ready to learn hence best suited for real-time predictions.

2. Multi-class prediction
The probability of multi-classes of any target variable can be predicted using a Naive Bayes algorithm.

3. Recommendation system
Naive Bayes classifier with the help of Collaborative Filtering builds a Recommendation System. This system uses data mining and machine learning techniques to filter the information which is not seen before and then predict whether a user would appreciate a given resource or not.

4. Text classification/ Sentiment Analysis/ Spam Filtering
Due to its better performance with multi-class problems and its independence rule, Naive Bayes algorithm performs better or have a higher success rate in text classification, Therefore, it is used in Sentiment Analysis and Spam filtering.

## Difference between Bayes and Naive Bayes Algorithm

The naive Bayes classifier is an approximation to the Bayes classifier, in which we assume that the features are conditionally independent given the class instead of modeling their full conditional distribution given the class. A Bayes classifier is best interpreted as a decision rule. Suppose we seek to estimate the class of ("classify") an observation is given a vector of features. Denote the class C and the vector of features (F1, F2,…, Fk). Given a probability model underlying the data (that is, given the joint distribution of (C, F1, F2,…, Fk), the Bayes classification function chooses a class by maximizing the probability of the class given the observed features: argmaxc P(C=c∣F1=f1,…,Fk=fk)

## Assumptions of Naive Bayes Algorithm

1. All the features are independent, that is there are no dependencies between any of the features.
2. Each of the features is given equal or the same importance or equal weight

1. Categorical Inputs
Naive Bayes assumes label attributes such as binary, categorical, or nominal.
2. Gaussian Inputs
If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean).
3. Classification Problems
Naive Bayes works best with binary and multiclass classification.
4. Log Probabilities
The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. We should use a log transform of the probabilities to avoid an underflow of numerical precision.
5. Kernel Functions
Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions.
6. Update Probabilities
When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

## Types of Naive Bayes Algorithm

### 1. Multinomial Naive Bayes

This is mostly used for the document classification problems, i.e whether a the document belongs to the category of sports, politics, technology, etc. The features/predictors used by the classifier are the frequency of the words present in the document.

### 2. Bernoulli Naive Bayes

This is similar to the multinomial naive Bayes but the predictors are boolean variables. The parameters that we use to predict the class the variable takes up only values yes or no, for example, if a word occurs in the text or not.

### 3. Gaussian Naive Bayes

When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

### 4. Semi-supervised parameter estimation

Given away to train a naive Bayes classifier from labeled data, it's possible to construct a semi-supervised training algorithm that can learn from a combination of labeled and unlabeled data by running the supervised learning algorithm in a loop

## What is Naive Bayes?

 Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total Banana 400 100 350 150 450 50 500 Orange 0 300 150 150 300 0 300 Other 100 100 150 50 50 150 200 Total 500 500 650 350 800 200 1000

So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or ‘Orange’ or ‘Other’ when only the 3 features (long, sweet, and yellow) are known.

So to predict this we need to find 3 probabilities. Let's start

1. We first calculate the "Prior" probabilities for each of the class of fruits
P[Y=Banana] = 500/1000 = 0.5
P[Y=Orange] = 300/1000 = 0.3
P[Y=Other] = 200/1000 = 0.2

2. We then compute the probability of evidence that goes in the denominator
P[x1=Long] =  500/100 = 0.5
P[x2=Sweet] = 650/100 = 0.65
P[x3=Yellow] = 800/100 = 0.8

3. Now we calculate the probability of likelihood of evidences that goes in the numerator
P[x1=Long | Y=Banana] = 400/500 = 0.8
P[x2=Sweet | Y=Banana] = 350/500 = 0.7
P[x3=Yellow | Y=Banana] = 450/500 = 0.9

4. At the end we substitute all the values in the Naive Bayes Formula,

a. P(Banana | Long, Sweet and Yellow)
= ((P(Long | Banana) * P(Sweet | Banana) * P(Yellow | Banana))*P(Banana))/ (P(Long) * P(Sweet) * P(Yellow))
= ((0.8 * 0.7 * 0.9) * 0.5)/(0.5 * 0.65 * 0.8)
=  0.97

b. P(Orange | Long, Sweet and Yellow)
=  ((P(Long | Orange) * P(Sweet | Orange) * P(Yellow | Orange))*P(Orange))/ (P(Long) * P(Sweet) * P(Yellow))
= 0

c. P(Others | Long, Sweet and Yellow)
= ((P(Long | Others) * P(Sweet | Others) * P(Yellow | Others))*P(Others))/ (P(Long) * P(Sweet) * P(Yellow))
= 0.07

So, from the Naive Bayes Classifier, we predict the fruit is a Banana.

## Python Implementation of Decision Tree

Let's take the example of the IRIS dataset, you can directly import it from the sklearn dataset repository. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.

### 1. Using functions

2. from math import sqrt
3. from math import exp
4. from math import pi
5.
6. # Load a CSV file
8.     dataset = list()
9.     with open(filename, 'r') as file:
12.             if not row:
13.                 continue
14.             dataset.append(row)
15.     return dataset
16.
17. # Convert string column to float
18. def str_column_to_float(dataset, column):
19.     for row in dataset:
20.         row[column] = float(row[column].strip())
21.
22. # Convert string column to integer
23. def str_column_to_int(dataset, column):
24.     class_values = [row[column] for row in dataset]
25.     unique = set(class_values)
26.     lookup = dict()
27.     for i, value in enumerate(unique):
28.         lookup[value] = i
29.         print('[%s] => %d' % (value, i))
30.     for row in dataset:
31.         row[column] = lookup[row[column]]
32.     return lookup
33.
34. # Split the dataset by class values, returns a dictionary
35. def separate_by_class(dataset):
36.     separated = dict()
37.     for i in range(len(dataset)):
38.         vector = dataset[i]
39.         class_value = vector[-1]
40.         if (class_value not in separated):
41.             separated[class_value] = list()
42.         separated[class_value].append(vector)
43.     return separated
44.
45. # Calculate the mean of a list of numbers
46. def mean(numbers):
47.     return sum(numbers)/float(len(numbers))
48.
49. # Calculate the standard deviation of a list of numbers
50. def stdev(numbers):
51.     avg = mean(numbers)
52.     variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
53.     return sqrt(variance)
54.
55. # Calculate the mean, stdev and count for each column in a dataset
56. def summarize_dataset(dataset):
57.     summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
58.     del(summaries[-1])
59.     return summaries
60.
61. # Split dataset by class then calculate statistics for each row
62. def summarize_by_class(dataset):
63.     separated = separate_by_class(dataset)
64.     summaries = dict()
65.     for class_value, rows in separated.items():
66.         summaries[class_value] = summarize_dataset(rows)
67.     return summaries
68.
69. # Calculate the Gaussian probability distribution function for x
70. def calculate_probability(x, mean, stdev):
71.     exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
72.     return (1 / (sqrt(2 * pi) * stdev)) * exponent
73.
74. # Calculate the probabilities of predicting each class for a given row
75. def calculate_class_probabilities(summaries, row):
76.     total_rows = sum([summaries[label][2for label in summaries])
77.     probabilities = dict()
78.     for class_value, class_summaries in summaries.items():
79.         probabilities[class_value] = summaries[class_value]/float(total_rows)
80.         for i in range(len(class_summaries)):
81.             mean, stdev, _ = class_summaries[i]
82.             probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
83.     return probabilities
84.
85. # Predict the class for a given row
86. def predict(summaries, row):
87.     probabilities = calculate_class_probabilities(summaries, row)
88.     best_label, best_prob = None, -1
89.     for class_value, probability in probabilities.items():
90.         if best_label is None or probability > best_prob:
91.             best_prob = probability
92.             best_label = class_value
93.     return best_label
94.
95. # Make a prediction with Naive Bayes on Iris Dataset
96. filename = 'iris.csv'
98. for i in range(len(dataset)-1):
99.     str_column_to_float(dataset, i)
100. # convert class column to integers
101. str_column_to_int(dataset, len(dataset)-1)
102. # fit model
103. model = summarize_by_class(dataset)
104. # define a new record
105. row = [5.7,2.9,4.2,1.3]
106. # predict the label
107. label = predict(model, row)
108. print('Data=%s, Predicted: %s' % (row, label))
The output that I am getting is

[Iris-versicolor] => 0
[Iris-setosa] => 1
[Iris-virginica] => 2

Data=[5.7, 2.9, 4.2, 1.3], Predicted: 0

### 2. Using Sklearn

1. from sklearn import datasets
2.
5. # print the names of the 13 features
6. print ("Features: ", wine.feature_names)
7.
8. # print the label type of wine(class_0, class_1, class_2)
9. print ("Labels: ", wine.target_names)
10.
11. # Import train_test_split function
12. from sklearn.model_selection import train_test_split
13.
14. # Split the dataset into the training set and test set
15. X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3,random_state=109)
16.
17. #Import Gaussian Naive Bayes model
18. from sklearn.naive_bayes import GaussianNB
19.
20. #Create a Gaussian Classifier
21. gnb = GaussianNB()
22.
23. #Train the model using the training sets
24. gnb.fit(X_train, y_train)
25.
26. #Predict the response for test dataset
27. y_pred = gnb.predict(X_test)
28.
29. print("y_pred: ",y_pred)
30.
31. #Import scikit-learn metrics module for accuracy calculation
32. from sklearn import metrics
33.
34. # Model Accuracy, how often is the classifier correct?
35. print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
The output that I got is

Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

Labels: ['class_0' 'class_1' 'class_2']

y_pred: [0 0 1 2 0 1 0 0 1 0 2 2 2 2 0 1 1 0 0 1 2 1 0 2 0 0 1 2 0 1 2 1 1 0 1 1 0 2 2 0 2 1 0 0 0 2 2 0 1 1 2 0 0 2]

Accuracy: 0.9074074074074074

### 3. Using TensorFlow

1. from IPython import embed
2. from matplotlib import colors
3. from matplotlib import pyplot as plt
4. from sklearn import datasets
5. import numpy as np
6. import tensorflow as tf
7. from sklearn.utils.fixes import logsumexp
8. import numpy as np
9.
10.
11. class TFNaiveBayesClassifier:
12.     dist = None
13.
14.     def fit(self, X, y):
15.         # Separate training points by class (nb_classes * nb_samples * nb_features)
16.         unique_y = np.unique(y)
17.         points_by_class = np.array([
18.             [x for x, t in zip(X, y) if t == c]
19.             for c in unique_y])
20.
21.         # Estimate mean and variance for each class / feature
22.         # shape: nb_classes * nb_features
23.         mean, var = tf.nn.moments(tf.constant(points_by_class), axes=)
24.
25.         # Create a 3x2 univariate normal distribution with the
26.         # known mean and variance
27.         self.dist = tf.distributions.Normal(loc=mean, scale=tf.sqrt(var))
28.
29.     def predict(self, X):
30.         assert self.dist is not None
31.         nb_classes, nb_features = map(int, self.dist.scale.shape)
32.
33.         # Conditional probabilities log P(x|c) with shape
34.         # (nb_samples, nb_classes)
35.         cond_probs = tf.reduce_sum(
36.             self.dist.log_prob(
37.                 tf.reshape(
38.                     tf.tile(X, [1, nb_classes]), [-1, nb_classes, nb_features])),
39.             axis=2)
40.
41.         # uniform priors
42.         priors = np.log(np.array([1. / nb_classes] * nb_classes))
43.
44.         # posterior log probability, log P(c) + log P(x|c)
46.
47.         # normalize to get (log)-probabilities
48.         norm_factor = tf.reduce_logsumexp(
49.             joint_likelihood, axis=1, keep_dims=True)
50.         log_prob = joint_likelihood - norm_factor
51.         # exp to get the actual probabilities
52.         return tf.exp(log_prob)
53.
54.
55. if __name__ == '__main__':
57.     # Only take the first two features
58.     X = iris.data[:, :2]
59.     y = iris.target
60.
61.     tf_nb = TFNaiveBayesClassifier()
62.     tf_nb.fit(X, y)
63.
64.     # Create a regular grid and classify each point
65.     x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
66.     y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
67.     xx, yy = np.meshgrid(np.linspace(x_min, x_max, 30),
68.                          np.linspace(y_min, y_max, 30))
69.     s = tf.Session()
70.     Z = s.run(tf_nb.predict(np.c_[xx.ravel(), yy.ravel()]))
71.     # Extract probabilities of class 2 and 3
72.     Z1 = Z[:, 1].reshape(xx.shape)
73.     Z2 = Z[:, 2].reshape(xx.shape)
74.
75.     # Plot
76.     fig = plt.figure(figsize=(53.75))
78.
79.     ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
80.                 edgecolor='k')
81.     # Swap signs to make the contour dashed (MPL default)
82.     ax.contour(xx, yy, -Z1, [-0.5], colors='k')
83.     ax.contour(xx, yy, -Z2, [-0.5], colors='k')
84.
85.     ax.set_xlabel('Sepal length')
86.     ax.set_ylabel('Sepal width')
87.     ax.set_title('TensorFlow decision boundary')
88.     ax.set_xlim(x_min, x_max)
89.     ax.set_ylim(y_min, y_max)
90.     ax.set_xticks(())
91.     ax.set_yticks(())
92.
93.     plt.tight_layout()
94.     fig.savefig('tf_iris.png', bbox_inches='tight'
The output that I got is: 