# Implementation of Decision Trees In Python

## Decision Tree using Python

In the previous article, we studied Multiple Linear Regression. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.

## What is Decision Tree?

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Decision Trees (DTs) are a non-parametric supervised learning method used for both classification and regression. Decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules, and the fitter the model.

The decision tree builds classification or regression models in the form of a tree structure, hence called CART (Classification and Regression Trees). It breaks down a data set into smaller and smaller subsets building along an associated decision tree at the same time. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. The leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called the root node. Decision trees can handle both categorical and numerical data.

## When is Decision Tree Used?

1. When the user has an objective he is trying to achieve: max profit, optimize cost
2. When there are several courses of action
3. There is a calculated measure of the benefit of the various alternatives
4. When there are events beyond the control of the decision-maker i.e environment factor.
5. Uncertainty concerning which outcome will actually happen

## Assumptions of Decision Tree

1. In the beginning, the whole training set is considered as the root.
2. Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
3. Records are distributed recursively on the basis of attribute values.
4. Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

## Key Terms

1. Root Node - It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
2. Splitting - It is a process of dividing a node into two or more sub-nodes.
3. Decision Node - When a sub-node splits into further sub-nodes, then it is called a decision node.
4. Leaf/ Terminal Node - Nodes do not split is called Leaf or Terminal node.
5. Pruning - When we remove the sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
6. Branch / Sub-Tree - A subsection of the entire tree is called branch or sub-tree.
7. Parent and Child Node - A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.
8. Entropy - A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID 3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided it has an entropy of one.
9. Information Gain - The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain (i.e., the most homogeneous branches).
10. Internal Node - An internal node (also known as an inner node, inode for short, or branch node) is any node of a tree that has child nodes
11. Branch - The lines connecting elements are called "branches".

## How to Make a Decision Tree?

Step 1
Calculate the entropy of the target.

Step 2
The dataset is then split into different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain or decrease in entropy.

Step 3
Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.

1. Decision trees require less effort for data preparation as compared to other algorithms during pre-processing.
2. Do not require normalization of data.
3. Do not require scaling of data as well.
4. Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent.
5. A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

1. A small change in the data can cause a large change in the structure of the decision tree causing instability.
2. For a Decision tree sometimes calculation can go far more complex compared to other algorithms.
3. It often involves a higher time to train the model.
4. Its training is relatively expensive because the complexity and time taken are more.
5. Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

## Types of Decision Tree

1. Categorical Variable Decision Tree
Decision Tree which has a categorical target variable then it called a categorical variable decision tree. For Example, I say Will Bangladesh be able to beat India?, so I can have the decision parameter as YES or NO.
2. Continuous Variable Decision Tree
Decision Tree has a continuous target variable then it is called Continuous Variable Decision Tree. For Example, I say What is the percentage of chances that India will win? so here we have a varying value.
3. ID3 (Iterative Dichotomiser 3)
4. C4.5 (successor of ID3)
5. CART (Classification And Regression Tree)
6. CHAID (CHi-squared Automatic Interaction Detector)
7. MARS (Multivariate Adaptive Regression Splines)
8. Conditional Inference Trees.

## What is the importance of a Decision Tree?

1. Simple to easy, use, understand and explain
2. They do feature selection and variable screening
3. They require very little efforts to prepare and produce
4. Can live with nonlinear relationships and parameters also
5. Can use conditional probability-based reasoning or Bayes theorem
6. They provide strategic answers to uncertain situations

## Regression Tree vs Classification Tree

 Regression Tree Classification Tree Dependent Variable Used when the dependent variable is continuous Used when the dependent variable is categorical Unseen Data Makes prediction with the mean value Makes prediction with mode values Predictor Space divides into a distinct and non-overlapping region divides into a distinct and non-overlapping region Approach Followed Top-Down Greedy Top-Down Greedy

## Entropy and Information Gain Calculations

Entropy is something is very important when we have to find the Information Gain, which in turn places a vital role in selecting the next attribute.

Example 1

Entropy([29+,35-]) = -29/64*log2 (29/64) – 35/64*log2 (35/64)
​                              = 0.99
Entropy([21+,5-]) = -21/26*log2 (21/26) - 5/26*log2 (5/26)
= 0.71
Entropy([8+,30-]) =  -8/38*log2 (8/38) - 30/38 *log2 (30/38)
= 0.74
Information Gain([29+,35-]),A1) = 0.99 - (26/64)*0.71 - (38/64)*0.74
=-0.15

Example 2

Entropy([29+,35-]) = -29/64*log2 (29/64) – 35/64*log2 (35/64)
​                              = 0.99
Entropy([21+,5-]) = -18/51*log2 (18/51) - 33/51*log2 (33/51)
= 0.94
Entropy([8+,30-]) =  -11/13*log2 (11/13) - 2/13 *log2 (2/13)
= 0.62
Information Gain([29+,35-]),A1) = 0.99 - (52/64)*0.94 - (13/64)*0.62
=0.1

So we can see that the information gain is bigger incase of example 2(IG=0.1) as compared to example 1(IG=-0.15)

## Decision Tree Example

Till now we studied theory, now let's try out some hands-on. I will take a demo dataset and will construct a decision tree based upon that dataset.

I am taking the following dataset, where we make a decision on whether we will play or not based upon the given environmental conditions.

 Day Outlook Temperature Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

### Calculation of Gini Index for "Outlook" feature

1. We can see that Outlook has 5 instances (5/14) as "Sunny", 4 instances (4/14) as "Overcast" and 5 instances as "Rain".
2. Outlook = Sunny, Play Tennis = Yes, we have 2 instances (2/5)
Outlook = Sunny, Play Tennis = No, we have 3 instances (3/5)
Hence, the Gini Index comes out to be:
= 1 - ((2/5)^2+(3/5)^2)
= 0.48
3. Outlook = Overcast, Play Tennis = Yes, we have 4 instances (4/4)
Outlook = Overcast, Play Tennis = No, we have 0 instances (0/4)
Hence, the Gini Index comes out to be:
= 1 - ((4/4)^2+(0/4)^2)
= 0
4. Outlook = Rain, Play Tennis = Yes, we have 3 instances (3/5)
Outlook = Rain, Play Tennis = No, we have 2 instances (2/5)
Hence, the Gini Index comes out to be:
= 1 - ((3/5)^2+(2/5)^2)
= 0.48
5. So the final Gini Index is:
= (5/14)*0.48 + (4/14)*0 + (5/14)*0.48
= 0.34

### Calculations of Gini Index of "Temperature" feature

1. We can see that, Temperature has 4 instances (4/14) as "Hot", 4 instances (4/14) as "Cool", 6 instances (6/14) as "Mild"
2. Temperature = Hot, Play Tennis = Yes, we have 2 instances (2/4)
Temperature = Hot, Play Tennis = No, we have 3 instances (2/4)
Hence, the Gini Index comes out to be:
= 1 - ((2/4)^2+(2/4)^2)
= 0.5
3. Temperature = Cool, Play Tennis = Yes, we have 3 instances (3/4)
Temperature = Cool, Play Tennis = No, we have 1 instances (1/4)
Hence, the Gini Index comes out to be:
= 1 - ((3/4)^2+(1/4)^2)
= 0.375
4. Temperature = Mild, Play Tennis = Yes, we have 4 instances (4/6)
Temperature = Mild, Play Tennis = No, we have 2 instances (2/6)
Hence, the Gini Index comes out to be:
= 1 - ((4/6)^2+(2/6)^2)
= 0.44
5. So the final Gini Index is:
= (4/14)*0.5 + (4/14)*0.375 + (6/14)*0.44
= 0.44

### Calculations of Gini Index of "Humidity" feature

1. We can see that, Humidity has 7 instances (7/14) as "High", 7 instances (7/14) as "Normal"
2. Humidity = High, Play Tennis = Yes, we have 3 instances (3/7)
Humidity = High, Play Tennis = No, we have 4 instances (4/7)
Hence, the Gini Index comes out to be:
= 1 - ((3/7)^2+(4/7)^2)
= 0.49
3. Humidity = Normal, Play Tennis = Yes, we have 6 instances (6/7)
Humidity = Normal, Play Tennis = No, we have 1 instance (1/7)
Hence, the Gini Index comes out to be:
= 1 - ((6/7)^2+(1/7)^2)
= 0.24
4. So the final Gini Index is:
= (7/14)*0.49 + (7/14)*0.24
= 0.365

### Calculations of Gini Index of "Wind" feature

1. We can see that, Wind has 8 instances (8/14) as "Weak", 6 instances (6/14) as "Strong"
2. Wind = Strong, Play Tennis = Yes, we have 3 instances (3/6)
Wind = Strong, Play Tennis = No, we have 4 instances (3/6)
Hence, the Gini Index comes out to be:
= 1 - ((3/6)^2+(3/6)^2)
= 0.5
3. Wind = Weak, Play Tennis = Yes, we have 6 instances (6/8)
Wind = Weak, Play Tennis = No, we have 2 instances (2/8)
Hence, the Gini Index comes out to be:
= 1 - ((6/8)^2+(2/8)^2)
= 0.375
4. So the final Gini Index is:
= (8/14)*0.375 + (6/14)*0.5
= 0.43

Since the Gini Index of Outlook is the smallest, we choose "outlook"

### Decision Tree using Disjunction of Conjunctions (OR of AND Operations)

Let us look at a rough decision tree, to look at how exactly the decision tree will look like

From the above trees we aren't able to clearly infer anything, so let me show you how exactly the Decision Tree will look for the given dataset and then we will try to understand, what exactly that tree means and how to use the tree to our benefit.

You may be wondering why did we take "Outlook" as root node", it is because of the Gini Index value, calculated earlier

Following is the Decision Tree with the Weights which can be used to calculate entropy and information gain.

Now let's try to understand the tree,

R1: If (Outlook=Sunny) _| (Humidity=High) Then PlayTennis=No ​
The above statement means that if Outlook is Sunny and Humidity is high, then we do not play

R2: If (Outlook=Sunny) _| (Humidity=Normal) Then PlayTennis=Yes​
The above statement means that if Outlook is Sunny and Humidity is normal, then we play

R3: If (Outlook=Overcast) Then PlayTennis=Yes ​
The above statement means that if Outlook is Overcast, then we play

R4: If (Outlook=Rain) -|  (Wind=Strong) Then PlayTennis=Yes
The above statement means that if Outlook is Rain or Wild is Strong, then we play

## Python Implementation of Decision Tree

Let's take the example of the IRIS dataset, you can directly import it from the sklearn dataset repository. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.

The dataset that I will be using here is UCI Zoo Dataset, you can download it from the article

### 1. Using NumPy

1. import pandas as pd
2. import numpy as np
3. from pprint import pprint
4. #Import the dataset and define the feature as well as the target datasets / columns#
6.                       names=['animal_name','hair','feathers','eggs','milk',
7.                                                    'airbone','aquatic','predator','toothed','backbone',
8.                                                   'breathes','venomous','fins','legs','tail','domestic','catsize','class',])#Import all columns omitting the fist which consists the names of the animals
9. #We drop the animal names since this is not a good feature to split the data on
10. dataset=dataset.drop('animal_name',axis=1)
11.
12. def entropy(target_col):
13.
14.     elements,counts = np.unique(target_col,return_counts = True)
15.     entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(elements))])
16.     return entropy
17.
18. def InfoGain(data,split_attribute_name,target_name="class"):
19.
20.     #Calculate the entropy of the total dataset
21.     total_entropy = entropy(data[target_name])
22.
23.     ##Calculate the entropy of the dataset
24.
25.     #Calculate the values and the corresponding counts for the split attribute
26.     vals,counts= np.unique(data[split_attribute_name],return_counts=True)
27.
28.     #Calculate the weighted entropy
29.     Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])
30.
31.     #Calculate the information gain
32.     Information_Gain = total_entropy - Weighted_Entropy
33.     return Information_Gain
34.
35. def ID3(data,originaldata,features,target_attribute_name="class",parent_node_class = None):
36.
37.     #Define the stopping criteria --> If one of this is satisfied, we want to return a leaf node#
38.
39.     #If all target_values have the same value, return this value
40.     if len(np.unique(data[target_attribute_name])) <= 1:
41.         return np.unique(data[target_attribute_name])[0]
42.
43.     #If the dataset is empty, return the mode target feature value in the original dataset
44.     elif len(data)==0:
45.         return np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribute_name],return_counts=True)[1])]
46.
47.     #If the feature space is empty, return the mode target feature value of the direct parent node --> Note that
48.     #the direct parent node is that node which has called the current run of the ID3 algorithm and hence
49.     #the mode target feature value is stored in the parent_node_class variable.
50.
51.     elif len(features) ==0:
52.         return parent_node_class
53.
54.     #If none of the above holds true, grow the tree!
55.
56.     else:
57.         #Set the default value for this node --> The mode target feature value of the current node
58.         parent_node_class = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return_counts=True)[1])]
59.
60.         #Select the feature which best splits the dataset
61.         item_values = [InfoGain(data,feature,target_attribute_name) for feature in features] #Return the information gain values for the features in the dataset
62.         best_feature_index = np.argmax(item_values)
63.         best_feature = features[best_feature_index]
64.
65.         #Create the tree structure. The root gets the name of the feature (best_feature) with the maximum information
66.         #gain in the first run
67.         tree = {best_feature:{}}
68.
69.
70.         #Remove the feature with the best inforamtion gain from the feature space
71.         features = [i for i in features if i != best_feature]
72.
73.         #Grow a branch under the root node for each possible value of the root node feature
74.
75.         for value in np.unique(data[best_feature]):
76.             value = value
77.             #Split the dataset along the value of the feature with the largest information gain and therwith create sub_datasets
78.             sub_data = data.where(data[best_feature] == value).dropna()
79.
80.             #Call the ID3 algorithm for each of those sub_datasets with the new parameters --> Here the recursion comes in!
81.             subtree = ID3(sub_data,dataset,features,target_attribute_name,parent_node_class)
82.
83.             #Add the sub tree, grown from the sub_dataset to the tree under the root node
84.             tree[best_feature][value] = subtree
85.
86.         return(tree)
87.
88. def predict(query,tree,default = 1):
89.
90.     for key in list(query.keys()):
91.         if key in list(tree.keys()):
92.
93.             try:
94.                 result = tree[key][query[key]]
95.             except:
96.                 return default
97.
98.             result = tree[key][query[key]]
99.
100.             if isinstance(result,dict):
101.                 return predict(query,result)
102.             else:
103.                 return result
104.
105. def train_test_split(dataset):
106.     training_data = dataset.iloc[:80].reset_index(drop=True)#We drop the index respectively relabel the index
107.     #starting form 0, because we do not want to run into errors regarding the row labels / indexes
108.     testing_data = dataset.iloc[80:].reset_index(drop=True)
109.     return training_data,testing_data
110.
111. training_data = train_test_split(dataset)[0]
112. testing_data = train_test_split(dataset)[1]
113.
114. def test(data,tree):
115.     #Create new query instances by simply removing the target feature column from the original dataset and
116.     #convert it to a dictionary
117.     queries = data.iloc[:,:-1].to_dict(orient = "records")
118.
119.     #Create a empty DataFrame in whose columns the prediction of the tree are stored
120.     predicted = pd.DataFrame(columns=["predicted"])
121.
122.     #Calculate the prediction accuracy
123.     for i in range(len(data)):
124.         predicted.loc[i,"predicted"] = predict(queries[i],tree,1.0)
125.     print('The prediction accuracy is: ',(np.sum(predicted["predicted"] == data["class"])/len(data))*100,'%')
126.
127. tree = ID3(training_data,training_data,training_data.columns[:-1])
128. pprint(tree)
129. test(testing_data,tree)
the output that I got is

{'legs': {0: {'fins': {0.0: {'toothed': {0.0: 7.0, 1.0: 3.0}},
1.0: {'eggs': {0.0: 1.0, 1.0: 4.0}}}},
2: {'hair': {0.0: 2.0, 1.0: 1.0}},
4: {'hair': {0.0: {'toothed': {0.0: 7.0, 1.0: 5.0}}, 1.0: 1.0}},
6: {'aquatic': {0.0: 6.0, 1.0: 7.0}}, 8: 7.0}}

The prediction accuracy is: 85.71428571428571 %

### 2. Using SKLearn

1. #Import the DecisionTreeClassifier
2. from sklearn.tree import DecisionTreeClassifier
3.
4. #Import the dataset
6.                       names=['animal_name','hair','feathers','eggs','milk',
7.                                                    'airbone','aquatic','predator','toothed','backbone',
8.                                                   'breathes','venomous','fins','legs','tail','domestic','catsize','class',])
9. #We drop the animal names since this is not a good feature to split the data on
10. dataset=dataset.drop('animal_name',axis=1)
11.
12. train_features = dataset.iloc[:80,:-1]
13. test_features = dataset.iloc[80:,:-1]
14. train_targets = dataset.iloc[:80,-1]
15. test_targets = dataset.iloc[80:,-1]
16.
17. tree = DecisionTreeClassifier(criterion = 'entropy').fit(train_features,train_targets)
18.
19. prediction = tree.predict(test_features)
20.
21. print("The prediction accuracy is: ",tree.score(test_features,test_targets)*100,"%"
The output that I got is:

The prediction accuracy is: 80.95238095238095 %

### 3. Using TensorFlow

1. import numpy as np
2. import pandas as pd
3. import tensorflow as tf
4. from functools import reduce
5. import matplotlib.pyplot as plt
6. %matplotlib inline
7.
8. np.random.seed(1943)
9. tf.set_random_seed(1943)
10.
11. iris = [((5.13.51.40.2), (100)),
12.         ((4.93.01.40.2), (100)),
13.         ((4.73.21.30.2), (100)),
14.         ((4.63.11.50.2), (100)),
15.         ((5.03.61.40.2), (100)),
16.         ((5.43.91.70.4), (100)),
17.         ((4.63.41.40.3), (100)),
18.         ((5.03.41.50.2), (100)),
19.         ((4.42.91.40.2), (100)),
20.         ((4.93.11.50.1), (100)),
21.         ((5.43.71.50.2), (100)),
22.         ((4.83.41.60.2), (100)),
23.         ((4.83.01.40.1), (100)),
24.         ((4.33.01.10.1), (100)),
25.         ((5.84.01.20.2), (100)),
26.         ((5.74.41.50.4), (100)),
27.         ((5.43.91.30.4), (100)),
28.         ((5.13.51.40.3), (100)),
29.         ((5.73.81.70.3), (100)),
30.         ((5.13.81.50.3), (100)),
31.         ((5.43.41.70.2), (100)),
32.         ((5.13.71.50.4), (100)),
33.         ((4.63.61.00.2), (100)),
34.         ((5.13.31.70.5), (100)),
35.         ((4.83.41.90.2), (100)),
36.         ((5.03.01.60.2), (100)),
37.         ((5.03.41.60.4), (100)),
38.         ((5.23.51.50.2), (100)),
39.         ((5.23.41.40.2), (100)),
40.         ((4.73.21.60.2), (100)),
41.         ((4.83.11.60.2), (100)),
42.         ((5.43.41.50.4), (100)),
43.         ((5.24.11.50.1), (100)),
44.         ((5.54.21.40.2), (100)),
45.         ((4.93.11.50.1), (100)),
46.         ((5.03.21.20.2), (100)),
47.         ((5.53.51.30.2), (100)),
48.         ((4.93.11.50.1), (100)),
49.         ((4.43.01.30.2), (100)),
50.         ((5.13.41.50.2), (100)),
51.         ((5.03.51.30.3), (100)),
52.         ((4.52.31.30.3), (100)),
53.         ((4.43.21.30.2), (100)),
54.         ((5.03.51.60.6), (100)),
55.         ((5.13.81.90.4), (100)),
56.         ((4.83.01.40.3), (100)),
57.         ((5.13.81.60.2), (100)),
58.         ((4.63.21.40.2), (100)),
59.         ((5.33.71.50.2), (100)),
60.         ((5.03.31.40.2), (100)),
61.         ((7.03.24.71.4), (010)),
62.         ((6.43.24.51.5), (010)),
63.         ((6.93.14.91.5), (010)),
64.         ((5.52.34.01.3), (010)),
65.         ((6.52.84.61.5), (010)),
66.         ((5.72.84.51.3), (010)),
67.         ((6.33.34.71.6), (010)),
68.         ((4.92.43.31.0), (010)),
69.         ((6.62.94.61.3), (010)),
70.         ((5.22.73.91.4), (010)),
71.         ((5.02.03.51.0), (010)),
72.         ((5.93.04.21.5), (010)),
73.         ((6.02.24.01.0), (010)),
74.         ((6.12.94.71.4), (010)),
75.         ((5.62.93.61.3), (010)),
76.         ((6.73.14.41.4), (010)),
77.         ((5.63.04.51.5), (010)),
78.         ((5.82.74.11.0), (010)),
79.         ((6.22.24.51.5), (010)),
80.         ((5.62.53.91.1), (010)),
81.         ((5.93.24.81.8), (010)),
82.         ((6.12.84.01.3), (010)),
83.         ((6.32.54.91.5), (010)),
84.         ((6.12.84.71.2), (010)),
85.         ((6.42.94.31.3), (010)),
86.         ((6.63.04.41.4), (010)),
87.         ((6.82.84.81.4), (010)),
88.         ((6.73.05.01.7), (010)),
89.         ((6.02.94.51.5), (010)),
90.         ((5.72.63.51.0), (010)),
91.         ((5.52.43.81.1), (010)),
92.         ((5.52.43.71.0), (010)),
93.         ((5.82.73.91.2), (010)),
94.         ((6.02.75.11.6), (010)),
95.         ((5.43.04.51.5), (010)),
96.         ((6.03.44.51.6), (010)),
97.         ((6.73.14.71.5), (010)),
98.         ((6.32.34.41.3), (010)),
99.         ((5.63.04.11.3), (010)),
100.         ((5.52.54.01.3), (010)),
101.         ((5.52.64.41.2), (010)),
102.         ((6.13.04.61.4), (010)),
103.         ((5.82.64.01.2), (010)),
104.         ((5.02.33.31.0), (010)),
105.         ((5.62.74.21.3), (010)),
106.         ((5.73.04.21.2), (010)),
107.         ((5.72.94.21.3), (010)),
108.         ((6.22.94.31.3), (010)),
109.         ((5.12.53.01.1), (010)),
110.         ((5.72.84.11.3), (010)),
111.         ((6.33.36.02.5), (001)),
112.         ((5.82.75.11.9), (001)),
113.         ((7.13.05.92.1), (001)),
114.         ((6.32.95.61.8), (001)),
115.         ((6.53.05.82.2), (001)),
116.         ((7.63.06.62.1), (001)),
117.         ((4.92.54.51.7), (001)),
118.         ((7.32.96.31.8), (001)),
119.         ((6.72.55.81.8), (001)),
120.         ((7.23.66.12.5), (001)),
121.         ((6.53.25.12.0), (001)),
122.         ((6.42.75.31.9), (001)),
123.         ((6.83.05.52.1), (001)),
124.         ((5.72.55.02.0), (001)),
125.         ((5.82.85.12.4), (001)),
126.         ((6.43.25.32.3), (001)),
127.         ((6.53.05.51.8), (001)),
128.         ((7.73.86.72.2), (001)),
129.         ((7.72.66.92.3), (001)),
130.         ((6.02.25.01.5), (001)),
131.         ((6.93.25.72.3), (001)),
132.         ((5.62.84.92.0), (001)),
133.         ((7.72.86.72.0), (001)),
134.         ((6.32.74.91.8), (001)),
135.         ((6.73.35.72.1), (001)),
136.         ((7.23.26.01.8), (001)),
137.         ((6.22.84.81.8), (001)),
138.         ((6.13.04.91.8), (001)),
139.         ((6.42.85.62.1), (001)),
140.         ((7.23.05.81.6), (001)),
141.         ((7.42.86.11.9), (001)),
142.         ((7.93.86.42.0), (001)),
143.         ((6.42.85.62.2), (001)),
144.         ((6.32.85.11.5), (001)),
145.         ((6.12.65.61.4), (001)),
146.         ((7.73.06.12.3), (001)),
147.         ((6.33.45.62.4), (001)),
148.         ((6.43.15.51.8), (001)),
149.         ((6.03.04.81.8), (001)),
150.         ((6.93.15.42.1), (001)),
151.         ((6.73.15.62.4), (001)),
152.         ((6.93.15.12.3), (001)),
153.         ((5.82.75.11.9), (001)),
154.         ((6.83.25.92.3), (001)),
155.         ((6.73.35.72.5), (001)),
156.         ((6.73.05.22.3), (001)),
157.         ((6.32.55.01.9), (001)),
158.         ((6.53.05.22.0), (001)),
159.         ((6.23.45.42.3), (001)),
160.         ((5.93.05.11.8), (001))]
161.
162. feature = np.vstack([np.array(i[0]) for i in iris])
163. label = np.vstack([np.array(i[1]) for i in iris])
164.
165. x = feature[:, 2:4]  # use "Petal length" and "Petal width" only
166. y = label
167. d = x.shape[1]
168.
169. def tf_kron_prod(a, b):
170.     res = tf.einsum('ij,ik->ijk', a, b)
171.     res = tf.reshape(res, [-1, tf.reduce_prod(res.shape[1:])])
172.     return res
173.
174.
175. def tf_bin(x, cut_points, temperature=0.1):
176.     # x is a N-by-1 matrix (column vector)
177.     # cut_points is a D-dim vector (D is the number of cut-points)
178.     # this function produces a N-by-(D+1) matrix, each row has only one element being one and the rest are all zeros
179.     D = cut_points.get_shape().as_list()[0]
180.     W = tf.reshape(tf.linspace(1.0, D + 1.0, D + 1), [1, -1])
181.     cut_points = tf.contrib.framework.sort(cut_points)  # make sure cut_points is monotonically increasing
182.     b = tf.cumsum(tf.concat([tf.constant(0.0, shape=[1]), -cut_points], 0))
183.     h = tf.matmul(x, W) + b
184.     res = tf.nn.softmax(h / temperature)
185.     return res
186.
187.
188. def nn_decision_tree(x, cut_points_list, leaf_score, temperature=0.1):
189.     # cut_points_list contains the cut_points for each dimension of feature
190.     leaf = reduce(tf_kron_prod,
191.                   map(lambda z: tf_bin(x[:, z[0]:z[0] + 1], z[1], temperature), enumerate(cut_points_list)))
192.     return tf.matmul(leaf, leaf_score)
193.
194. num_cut = [11]  # "Petal length" and "Petal width"
195. num_leaf = np.prod(np.array(num_cut) + 1)
196. num_class = 3
197.
198. sess = tf.InteractiveSession()
199. x_ph = tf.placeholder(tf.float32, [None, d])
200. y_ph = tf.placeholder(tf.float32, [None, num_class])
201. cut_points_list = [tf.Variable(tf.random_uniform([i])) for i in num_cut]
202. leaf_score = tf.Variable(tf.random_uniform([num_leaf, num_class]))
203. y_pred = nn_decision_tree(x_ph, cut_points_list, leaf_score, temperature=0.1)
204. loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(logits=y_pred, onehot_labels=y_ph))
206. train_step = opt.minimize(loss)
207. sess.run(tf.global_variables_initializer())
208. for i in range(1000):
209.     _, loss_e = sess.run([train_step, loss], feed_dict={x_ph: x, y_ph: y})
210.     if i % 200 == 0:
211.         print(loss_e)
212. print('error rate %.2f' % (1 - np.mean(np.argmax(y_pred.eval(feed_dict={x_ph: x}), axis=1) == np.argmax(y, axis=1))))
213. sample_x0 = np.repeat(np.linspace(0, np.max(x[:,0]), 100), 100).reshape(-1,1)
214. sample_x1 = np.tile(np.linspace(0, np.max(x[:,1]), 100).reshape(-1,1), [100,1])
215. sample_x = np.hstack([sample_x0, sample_x1])
216. sample_label = np.argmax(y_pred.eval(feed_dict={x_ph: sample_x}), axis=1)
217. plt.figure(figsize=(8,8))
218. plt.scatter(x[:,0],
219.             x[:,1],
220.             c=np.argmax(y, axis=1),
221.             marker='o',
222.             s=50,
223.             cmap='summer',
224.             edgecolors='black')
225. plt.scatter(sample_x0.flatten(),
226.             sample_x1.flatten(),
227.             c=sample_label.flatten(),
228.             marker='D',
229.             s=20,
230.             cmap='summer',
231.             edgecolors='none',
232.             alpha=0.33
The output that I got is

1.1529802
0.32481167
0.1562062
0.15097024
0.118773825
error rate 0.04

## Conclusion

In this article, we studied what is decision tree, when is decision tree used, assumptions of decision tree, key terms, how to make decision tree, advantages of decision tree, disadvantages of decision tree, types of decision tree, what is the importance of decision tree, regression tree vs classification tree, entropy, information gain and gini index calculations, decision tree example, python implementation of decision tree using sklearn, numpy, and TensorFlow. Hope you were able to understand each and everything. For any doubts, please comment on your query.

In the next article, we will learn about Naive Bayes.

Congratulations!!! You have climbed your next step in becoming a successful ML Engineer.

Next Article In this Series >> Naive Bayes

C# Corner
MVP Program Director