Implementation of Decision Trees In Python

Learn basics of decisions trees and their roles in computer algorithms and how decision trees are used in Python and machine learning.

Introduction

 
In the previous article, we studied Multiple Linear Regression. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.
 

What is Decision Tree?

 
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
 
Decision Trees (DTs) are a non-parametric supervised learning method used for both classification and regression. Decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules, and the fitter the model.
 
The decision tree builds classification or regression models in the form of a tree structure, hence called CART (Classification and Regression Trees). It breaks down a data set into smaller and smaller subsets building along an associated decision tree at the same time. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. The leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.
   

When is Decision Tree Used?

  1. When the user has an objective he is trying to achieve: max profit, optimize cost 
  2. When there are several courses of action
  3. There is a calculated measure of the benefit of the various alternatives
  4. When there are events beyond the control of the decision-maker i.e environment factor.
  5. Uncertainty concerning which outcome will actually happen 

Assumptions of Decision Tree

  1. In the beginning, the whole training set is considered as the root.
  2. Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model. 
  3. Records are distributed recursively on the basis of attribute values. 
  4. Order to placing attributes as root or internal node of the tree is done by using some statistical approach.   

Key Terms 

  1. Root Node - It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting - It is a process of dividing a node into two or more sub-nodes.
  3. Decision Node - When a sub-node splits into further sub-nodes, then it is called a decision node.
  4. Leaf/ Terminal Node - Nodes do not split is called Leaf or Terminal node.
  5. Pruning - When we remove the sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  6. Branch / Sub-Tree - A subsection of the entire tree is called branch or sub-tree.
  7. Parent and Child Node - A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.
  8. Entropy - A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID 3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided it has an entropy of one.
  9. Information Gain - The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain (i.e., the most homogeneous branches).
  10. Internal Node - An internal node (also known as an inner node, inode for short, or branch node) is any node of a tree that has child nodes 
  11. Branch - The lines connecting elements are called "branches".

How to Make a Decision Tree? 

 
Step 1
Calculate the entropy of the target.
 
Step 2
The dataset is then split into different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain or decrease in entropy.
 
Step 3
Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.
 

Advantage/Features of Decision Tree 

  1. Decision trees require less effort for data preparation as compared to other algorithms during pre-processing.
  2. Do not require normalization of data.
  3. Do not require scaling of data as well.
  4. Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent.
  5. A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

Disadvantages/Shortcomings of Decision Tree 

  1. A small change in the data can cause a large change in the structure of the decision tree causing instability.
  2. For a Decision tree sometimes calculation can go far more complex compared to other algorithms.
  3. It often involves a higher time to train the model.
  4. Its training is relatively expensive because the complexity and time taken are more.
  5. Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

Types of Decision Tree

  1. Categorical Variable Decision Tree
    Decision Tree which has a categorical target variable then it called a categorical variable decision tree. For Example, I say Will Bangladesh be able to beat India?, so I can have the decision parameter as YES or NO.
  2. Continuous Variable Decision Tree 
    Decision Tree has a continuous target variable then it is called Continuous Variable Decision Tree. For Example, I say What is the percentage of chances that India will win? so here we have a varying value.
  3. ID3 (Iterative Dichotomiser 3)
  4. C4.5 (successor of ID3)
  5. CART (Classification And Regression Tree)
  6. CHAID (CHi-squared Automatic Interaction Detector)
  7. MARS (Multivariate Adaptive Regression Splines)
  8. Conditional Inference Trees.

What is the importance of a Decision Tree?

  1. Simple to easy, use, understand and explain
  2. They do feature selection and variable screening
  3. They require very little efforts to prepare and produce
  4. Can live with nonlinear relationships and parameters also
  5. Can use conditional probability-based reasoning or Bayes theorem
  6. They provide strategic answers to uncertain situations

Regression Tree vs Classification Tree

 
  Regression Tree Classification Tree
Dependent Variable Used when the dependent variable is continuous Used when the dependent variable is categorical
Unseen Data Makes prediction with the mean value  Makes prediction with mode values
Predictor Space  divides into a distinct and non-overlapping region divides into a distinct and non-overlapping region 
Approach Followed Top-Down Greedy Top-Down Greedy
 

Entropy and Information Gain Calculations

 
Entropy is something is very important when we have to find the Information Gain, which in turn places a vital role in selecting the next attribute.
 
Following is the formula we will use during this article,
 
entropy 
ig 
 
Example 1
 
example1 
 
Entropy([29+,35-]) = -29/64*log2 (29/64) – 35/64*log2 (35/64)
​                              = 0.99
Entropy([21+,5-]) = -21/26*log2 (21/26) - 5/26*log2 (5/26)
                             = 0.71
Entropy([8+,30-]) =  -8/38*log2 (8/38) - 30/38 *log2 (30/38)
                             = 0.74
Information Gain([29+,35-]),A1) = 0.99 - (26/64)*0.71 - (38/64)*0.74
                           =-0.15
 
Example 2
 
example2 
 
 
Entropy([29+,35-]) = -29/64*log2 (29/64) – 35/64*log2 (35/64)
​                              = 0.99
Entropy([21+,5-]) = -18/51*log2 (18/51) - 33/51*log2 (33/51)
                             = 0.94
Entropy([8+,30-]) =  -11/13*log2 (11/13) - 2/13 *log2 (2/13)
                             = 0.62
Information Gain([29+,35-]),A1) = 0.99 - (52/64)*0.94 - (13/64)*0.62
                           =0.1
 
So we can see that the information gain is bigger incase of example 2(IG=0.1) as compared to example 1(IG=-0.15)
 

Decision Tree Example

 
Till now we studied theory, now let's try out some hands-on. I will take a demo dataset and will construct a decision tree based upon that dataset.
 
I am taking the following dataset, where we make a decision on whether we will play or not based upon the given environmental conditions.
 
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
 

Calculation of Gini Index for "Outlook" feature

 
1. We can see that Outlook has 5 instances (5/14) as "Sunny", 4 instances (4/14) as "Overcast" and 5 instances as "Rain".
2. Outlook = Sunny, Play Tennis = Yes, we have 2 instances (2/5)
Outlook = Sunny, Play Tennis = No, we have 3 instances (3/5)
Hence, the Gini Index comes out to be:
              = 1 - ((2/5)^2+(3/5)^2)
              = 0.48
3. Outlook = Overcast, Play Tennis = Yes, we have 4 instances (4/4)
Outlook = Overcast, Play Tennis = No, we have 0 instances (0/4)
Hence, the Gini Index comes out to be:
               = 1 - ((4/4)^2+(0/4)^2)
               = 0
4. Outlook = Rain, Play Tennis = Yes, we have 3 instances (3/5)
Outlook = Rain, Play Tennis = No, we have 2 instances (2/5)
Hence, the Gini Index comes out to be:
               = 1 - ((3/5)^2+(2/5)^2)
               = 0.48
5. So the final Gini Index is:
               = (5/14)*0.48 + (4/14)*0 + (5/14)*0.48
               = 0.34
 

Calculations of Gini Index of "Temperature" feature

 
1. We can see that, Temperature has 4 instances (4/14) as "Hot", 4 instances (4/14) as "Cool", 6 instances (6/14) as "Mild"
2. Temperature = Hot, Play Tennis = Yes, we have 2 instances (2/4)
Temperature = Hot, Play Tennis = No, we have 3 instances (2/4)
Hence, the Gini Index comes out to be:
              = 1 - ((2/4)^2+(2/4)^2)
              = 0.5
3. Temperature = Cool, Play Tennis = Yes, we have 3 instances (3/4)
Temperature = Cool, Play Tennis = No, we have 1 instances (1/4)
Hence, the Gini Index comes out to be:
               = 1 - ((3/4)^2+(1/4)^2)
               = 0.375
4. Temperature = Mild, Play Tennis = Yes, we have 4 instances (4/6)
Temperature = Mild, Play Tennis = No, we have 2 instances (2/6)
Hence, the Gini Index comes out to be:
               = 1 - ((4/6)^2+(2/6)^2)
               = 0.44
5. So the final Gini Index is:
               = (4/14)*0.5 + (4/14)*0.375 + (6/14)*0.44
               = 0.44
 

Calculations of Gini Index of "Humidity" feature

 
1. We can see that, Humidity has 7 instances (7/14) as "High", 7 instances (7/14) as "Normal"
2. Humidity = High, Play Tennis = Yes, we have 3 instances (3/7)
Humidity = High, Play Tennis = No, we have 4 instances (4/7)
Hence, the Gini Index comes out to be:
              = 1 - ((3/7)^2+(4/7)^2)
              = 0.49
3. Humidity = Normal, Play Tennis = Yes, we have 6 instances (6/7)
Humidity = Normal, Play Tennis = No, we have 1 instance (1/7)
Hence, the Gini Index comes out to be:
               = 1 - ((6/7)^2+(1/7)^2)
               = 0.24
4. So the final Gini Index is:
               = (7/14)*0.49 + (7/14)*0.24
               = 0.365
 

Calculations of Gini Index of "Wind" feature

 
1. We can see that, Wind has 8 instances (8/14) as "Weak", 6 instances (6/14) as "Strong"
2. Wind = Strong, Play Tennis = Yes, we have 3 instances (3/6)
Wind = Strong, Play Tennis = No, we have 4 instances (3/6)
Hence, the Gini Index comes out to be:
              = 1 - ((3/6)^2+(3/6)^2)
              = 0.5
3. Wind = Weak, Play Tennis = Yes, we have 6 instances (6/8)
Wind = Weak, Play Tennis = No, we have 2 instances (2/8)
Hence, the Gini Index comes out to be:
               = 1 - ((6/8)^2+(2/8)^2)
               = 0.375
4. So the final Gini Index is:
               = (8/14)*0.375 + (6/14)*0.5
               = 0.43
 
Since the Gini Index of Outlook is the smallest, we choose "outlook"
 

Decision Tree using Conjunction (AND Operation)

 
conjunction 

Decision Tree using Disjunction (OR Operations)

 
disjunction 
 

Decision Tree using XOR

 
xor 
 

Decision Tree using Disjunction of Conjunctions (OR of AND Operations)

 
disjunctionsofconjunctions
 
Let us look at a rough decision tree, to look at how exactly the decision tree will look like
 
decisionTree
 
From the above trees we aren't able to clearly infer anything, so let me show you how exactly the Decision Tree will look for the given dataset and then we will try to understand, what exactly that tree means and how to use the tree to our benefit.
 
finalDecisionTree 
 
You may be wondering why did we take "Outlook" as root node", it is because of the Gini Index value, calculated earlier
 
Following is the Decision Tree with the Weights which can be used to calculate entropy and information gain.
 
dt1
 
 
Now let's try to understand the tree,
 
R1: If (Outlook=Sunny) _| (Humidity=High) Then PlayTennis=No ​
The above statement means that if Outlook is Sunny and Humidity is high, then we do not play
 
R2: If (Outlook=Sunny) _| (Humidity=Normal) Then PlayTennis=Yes​
The above statement means that if Outlook is Sunny and Humidity is normal, then we play
 
R3: If (Outlook=Overcast) Then PlayTennis=Yes ​
The above statement means that if Outlook is Overcast, then we play
 
R4: If (Outlook=Rain) -|  (Wind=Strong) Then PlayTennis=Yes
The above statement means that if Outlook is Rain or Wild is Strong, then we play
 

Python Implementation of Decision Tree

 
Let's take the example of the IRIS dataset, you can directly import it from the sklearn dataset repository. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.
 
Before we start with this, it is highly recommended you read the following tutorials
  1. Python Pandas
  2. Python Numpy
  3. Python Scikit Learn
  4. Python MatPlotLib
  5. Python Seaborn
  6. Python Tensorflow
The dataset that I will be using here is UCI Zoo Dataset, you can download it from the article
 

1. Using NumPy

  1. import pandas as pd  
  2. import numpy as np  
  3. from pprint import pprint  
  4. #Import the dataset and define the feature as well as the target datasets / columns#  
  5. dataset = pd.read_csv('zoo.csv',  
  6.                       names=['animal_name','hair','feathers','eggs','milk',  
  7.                                                    'airbone','aquatic','predator','toothed','backbone',  
  8.                                                   'breathes','venomous','fins','legs','tail','domestic','catsize','class',])#Import all columns omitting the fist which consists the names of the animals  
  9. #We drop the animal names since this is not a good feature to split the data on  
  10. dataset=dataset.drop('animal_name',axis=1)  
  11.   
  12. def entropy(target_col):  
  13.     
  14.     elements,counts = np.unique(target_col,return_counts = True)  
  15.     entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(elements))])  
  16.     return entropy  
  17.   
  18. def InfoGain(data,split_attribute_name,target_name="class"):  
  19.          
  20.     #Calculate the entropy of the total dataset  
  21.     total_entropy = entropy(data[target_name])  
  22.       
  23.     ##Calculate the entropy of the dataset  
  24.       
  25.     #Calculate the values and the corresponding counts for the split attribute   
  26.     vals,counts= np.unique(data[split_attribute_name],return_counts=True)  
  27.       
  28.     #Calculate the weighted entropy  
  29.     Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])  
  30.       
  31.     #Calculate the information gain  
  32.     Information_Gain = total_entropy - Weighted_Entropy  
  33.     return Information_Gain  
  34.   
  35. def ID3(data,originaldata,features,target_attribute_name="class",parent_node_class = None):  
  36.   
  37.     #Define the stopping criteria --> If one of this is satisfied, we want to return a leaf node#  
  38.       
  39.     #If all target_values have the same value, return this value  
  40.     if len(np.unique(data[target_attribute_name])) <= 1:  
  41.         return np.unique(data[target_attribute_name])[0]  
  42.       
  43.     #If the dataset is empty, return the mode target feature value in the original dataset  
  44.     elif len(data)==0:  
  45.         return np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribute_name],return_counts=True)[1])]  
  46.       
  47.     #If the feature space is empty, return the mode target feature value of the direct parent node --> Note that  
  48.     #the direct parent node is that node which has called the current run of the ID3 algorithm and hence  
  49.     #the mode target feature value is stored in the parent_node_class variable.  
  50.       
  51.     elif len(features) ==0:  
  52.         return parent_node_class  
  53.       
  54.     #If none of the above holds true, grow the tree!  
  55.       
  56.     else:  
  57.         #Set the default value for this node --> The mode target feature value of the current node  
  58.         parent_node_class = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return_counts=True)[1])]  
  59.           
  60.         #Select the feature which best splits the dataset  
  61.         item_values = [InfoGain(data,feature,target_attribute_name) for feature in features] #Return the information gain values for the features in the dataset  
  62.         best_feature_index = np.argmax(item_values)  
  63.         best_feature = features[best_feature_index]  
  64.           
  65.         #Create the tree structure. The root gets the name of the feature (best_feature) with the maximum information  
  66.         #gain in the first run  
  67.         tree = {best_feature:{}}  
  68.           
  69.           
  70.         #Remove the feature with the best inforamtion gain from the feature space  
  71.         features = [i for i in features if i != best_feature]  
  72.           
  73.         #Grow a branch under the root node for each possible value of the root node feature  
  74.           
  75.         for value in np.unique(data[best_feature]):  
  76.             value = value  
  77.             #Split the dataset along the value of the feature with the largest information gain and therwith create sub_datasets  
  78.             sub_data = data.where(data[best_feature] == value).dropna()  
  79.               
  80.             #Call the ID3 algorithm for each of those sub_datasets with the new parameters --> Here the recursion comes in!  
  81.             subtree = ID3(sub_data,dataset,features,target_attribute_name,parent_node_class)  
  82.               
  83.             #Add the sub tree, grown from the sub_dataset to the tree under the root node  
  84.             tree[best_feature][value] = subtree  
  85.               
  86.         return(tree)      
  87.                   
  88. def predict(query,tree,default = 1):  
  89.     
  90.     for key in list(query.keys()):  
  91.         if key in list(tree.keys()):  
  92.             
  93.             try:  
  94.                 result = tree[key][query[key]]   
  95.             except:  
  96.                 return default  
  97.     
  98.             result = tree[key][query[key]]  
  99.             
  100.             if isinstance(result,dict):  
  101.                 return predict(query,result)  
  102.             else:  
  103.                 return result  
  104.   
  105. def train_test_split(dataset):  
  106.     training_data = dataset.iloc[:80].reset_index(drop=True)#We drop the index respectively relabel the index  
  107.     #starting form 0, because we do not want to run into errors regarding the row labels / indexes  
  108.     testing_data = dataset.iloc[80:].reset_index(drop=True)  
  109.     return training_data,testing_data  
  110.   
  111. training_data = train_test_split(dataset)[0]  
  112. testing_data = train_test_split(dataset)[1]   
  113.   
  114. def test(data,tree):  
  115.     #Create new query instances by simply removing the target feature column from the original dataset and   
  116.     #convert it to a dictionary  
  117.     queries = data.iloc[:,:-1].to_dict(orient = "records")  
  118.       
  119.     #Create a empty DataFrame in whose columns the prediction of the tree are stored  
  120.     predicted = pd.DataFrame(columns=["predicted"])   
  121.       
  122.     #Calculate the prediction accuracy  
  123.     for i in range(len(data)):  
  124.         predicted.loc[i,"predicted"] = predict(queries[i],tree,1.0)   
  125.     print('The prediction accuracy is: ',(np.sum(predicted["predicted"] == data["class"])/len(data))*100,'%')  
  126.       
  127. tree = ID3(training_data,training_data,training_data.columns[:-1])  
  128. pprint(tree)  
  129. test(testing_data,tree) 
the output that I got is 
 
{'legs': {0: {'fins': {0.0: {'toothed': {0.0: 7.0, 1.0: 3.0}},
                              1.0: {'eggs': {0.0: 1.0, 1.0: 4.0}}}},
              2: {'hair': {0.0: 2.0, 1.0: 1.0}},
             4: {'hair': {0.0: {'toothed': {0.0: 7.0, 1.0: 5.0}}, 1.0: 1.0}},
             6: {'aquatic': {0.0: 6.0, 1.0: 7.0}}, 8: 7.0}}
 
The prediction accuracy is: 85.71428571428571 %
 

2. Using SKLearn

  1. #Import the DecisionTreeClassifier  
  2. from sklearn.tree import DecisionTreeClassifier  
  3.   
  4. #Import the dataset   
  5. dataset = pd.read_csv('zoo.csv',    
  6.                       names=['animal_name','hair','feathers','eggs','milk',    
  7.                                                    'airbone','aquatic','predator','toothed','backbone',    
  8.                                                   'breathes','venomous','fins','legs','tail','domestic','catsize','class',])  
  9. #We drop the animal names since this is not a good feature to split the data on  
  10. dataset=dataset.drop('animal_name',axis=1)  
  11.   
  12. train_features = dataset.iloc[:80,:-1]  
  13. test_features = dataset.iloc[80:,:-1]  
  14. train_targets = dataset.iloc[:80,-1]  
  15. test_targets = dataset.iloc[80:,-1]  
  16.   
  17. tree = DecisionTreeClassifier(criterion = 'entropy').fit(train_features,train_targets)  
  18.   
  19. prediction = tree.predict(test_features)  
  20.   
  21. print("The prediction accuracy is: ",tree.score(test_features,test_targets)*100,"%"
The output that I got is:
 
The prediction accuracy is: 80.95238095238095 %  
 

3. Using TensorFlow

  1. import numpy as np  
  2. import pandas as pd  
  3. import tensorflow as tf  
  4. from functools import reduce  
  5. import matplotlib.pyplot as plt  
  6. %matplotlib inline  
  7.   
  8. np.random.seed(1943)  
  9. tf.set_random_seed(1943)  
  10.   
  11. iris = [((5.13.51.40.2), (100)),  
  12.         ((4.93.01.40.2), (100)),  
  13.         ((4.73.21.30.2), (100)),  
  14.         ((4.63.11.50.2), (100)),  
  15.         ((5.03.61.40.2), (100)),  
  16.         ((5.43.91.70.4), (100)),  
  17.         ((4.63.41.40.3), (100)),  
  18.         ((5.03.41.50.2), (100)),  
  19.         ((4.42.91.40.2), (100)),  
  20.         ((4.93.11.50.1), (100)),  
  21.         ((5.43.71.50.2), (100)),  
  22.         ((4.83.41.60.2), (100)),  
  23.         ((4.83.01.40.1), (100)),  
  24.         ((4.33.01.10.1), (100)),  
  25.         ((5.84.01.20.2), (100)),  
  26.         ((5.74.41.50.4), (100)),  
  27.         ((5.43.91.30.4), (100)),  
  28.         ((5.13.51.40.3), (100)),  
  29.         ((5.73.81.70.3), (100)),  
  30.         ((5.13.81.50.3), (100)),  
  31.         ((5.43.41.70.2), (100)),  
  32.         ((5.13.71.50.4), (100)),  
  33.         ((4.63.61.00.2), (100)),  
  34.         ((5.13.31.70.5), (100)),  
  35.         ((4.83.41.90.2), (100)),  
  36.         ((5.03.01.60.2), (100)),  
  37.         ((5.03.41.60.4), (100)),  
  38.         ((5.23.51.50.2), (100)),  
  39.         ((5.23.41.40.2), (100)),  
  40.         ((4.73.21.60.2), (100)),  
  41.         ((4.83.11.60.2), (100)),  
  42.         ((5.43.41.50.4), (100)),  
  43.         ((5.24.11.50.1), (100)),  
  44.         ((5.54.21.40.2), (100)),  
  45.         ((4.93.11.50.1), (100)),  
  46.         ((5.03.21.20.2), (100)),  
  47.         ((5.53.51.30.2), (100)),  
  48.         ((4.93.11.50.1), (100)),  
  49.         ((4.43.01.30.2), (100)),  
  50.         ((5.13.41.50.2), (100)),  
  51.         ((5.03.51.30.3), (100)),  
  52.         ((4.52.31.30.3), (100)),  
  53.         ((4.43.21.30.2), (100)),  
  54.         ((5.03.51.60.6), (100)),  
  55.         ((5.13.81.90.4), (100)),  
  56.         ((4.83.01.40.3), (100)),  
  57.         ((5.13.81.60.2), (100)),  
  58.         ((4.63.21.40.2), (100)),  
  59.         ((5.33.71.50.2), (100)),  
  60.         ((5.03.31.40.2), (100)),  
  61.         ((7.03.24.71.4), (010)),  
  62.         ((6.43.24.51.5), (010)),  
  63.         ((6.93.14.91.5), (010)),  
  64.         ((5.52.34.01.3), (010)),  
  65.         ((6.52.84.61.5), (010)),  
  66.         ((5.72.84.51.3), (010)),  
  67.         ((6.33.34.71.6), (010)),  
  68.         ((4.92.43.31.0), (010)),  
  69.         ((6.62.94.61.3), (010)),  
  70.         ((5.22.73.91.4), (010)),  
  71.         ((5.02.03.51.0), (010)),  
  72.         ((5.93.04.21.5), (010)),  
  73.         ((6.02.24.01.0), (010)),  
  74.         ((6.12.94.71.4), (010)),  
  75.         ((5.62.93.61.3), (010)),  
  76.         ((6.73.14.41.4), (010)),  
  77.         ((5.63.04.51.5), (010)),  
  78.         ((5.82.74.11.0), (010)),  
  79.         ((6.22.24.51.5), (010)),  
  80.         ((5.62.53.91.1), (010)),  
  81.         ((5.93.24.81.8), (010)),  
  82.         ((6.12.84.01.3), (010)),  
  83.         ((6.32.54.91.5), (010)),  
  84.         ((6.12.84.71.2), (010)),  
  85.         ((6.42.94.31.3), (010)),  
  86.         ((6.63.04.41.4), (010)),  
  87.         ((6.82.84.81.4), (010)),  
  88.         ((6.73.05.01.7), (010)),  
  89.         ((6.02.94.51.5), (010)),  
  90.         ((5.72.63.51.0), (010)),  
  91.         ((5.52.43.81.1), (010)),  
  92.         ((5.52.43.71.0), (010)),  
  93.         ((5.82.73.91.2), (010)),  
  94.         ((6.02.75.11.6), (010)),  
  95.         ((5.43.04.51.5), (010)),  
  96.         ((6.03.44.51.6), (010)),  
  97.         ((6.73.14.71.5), (010)),  
  98.         ((6.32.34.41.3), (010)),  
  99.         ((5.63.04.11.3), (010)),  
  100.         ((5.52.54.01.3), (010)),  
  101.         ((5.52.64.41.2), (010)),  
  102.         ((6.13.04.61.4), (010)),  
  103.         ((5.82.64.01.2), (010)),  
  104.         ((5.02.33.31.0), (010)),  
  105.         ((5.62.74.21.3), (010)),  
  106.         ((5.73.04.21.2), (010)),  
  107.         ((5.72.94.21.3), (010)),  
  108.         ((6.22.94.31.3), (010)),  
  109.         ((5.12.53.01.1), (010)),  
  110.         ((5.72.84.11.3), (010)),  
  111.         ((6.33.36.02.5), (001)),  
  112.         ((5.82.75.11.9), (001)),  
  113.         ((7.13.05.92.1), (001)),  
  114.         ((6.32.95.61.8), (001)),  
  115.         ((6.53.05.82.2), (001)),  
  116.         ((7.63.06.62.1), (001)),  
  117.         ((4.92.54.51.7), (001)),  
  118.         ((7.32.96.31.8), (001)),  
  119.         ((6.72.55.81.8), (001)),  
  120.         ((7.23.66.12.5), (001)),  
  121.         ((6.53.25.12.0), (001)),  
  122.         ((6.42.75.31.9), (001)),  
  123.         ((6.83.05.52.1), (001)),  
  124.         ((5.72.55.02.0), (001)),  
  125.         ((5.82.85.12.4), (001)),  
  126.         ((6.43.25.32.3), (001)),  
  127.         ((6.53.05.51.8), (001)),  
  128.         ((7.73.86.72.2), (001)),  
  129.         ((7.72.66.92.3), (001)),  
  130.         ((6.02.25.01.5), (001)),  
  131.         ((6.93.25.72.3), (001)),  
  132.         ((5.62.84.92.0), (001)),  
  133.         ((7.72.86.72.0), (001)),  
  134.         ((6.32.74.91.8), (001)),  
  135.         ((6.73.35.72.1), (001)),  
  136.         ((7.23.26.01.8), (001)),  
  137.         ((6.22.84.81.8), (001)),  
  138.         ((6.13.04.91.8), (001)),  
  139.         ((6.42.85.62.1), (001)),  
  140.         ((7.23.05.81.6), (001)),  
  141.         ((7.42.86.11.9), (001)),  
  142.         ((7.93.86.42.0), (001)),  
  143.         ((6.42.85.62.2), (001)),  
  144.         ((6.32.85.11.5), (001)),  
  145.         ((6.12.65.61.4), (001)),  
  146.         ((7.73.06.12.3), (001)),  
  147.         ((6.33.45.62.4), (001)),  
  148.         ((6.43.15.51.8), (001)),  
  149.         ((6.03.04.81.8), (001)),  
  150.         ((6.93.15.42.1), (001)),  
  151.         ((6.73.15.62.4), (001)),  
  152.         ((6.93.15.12.3), (001)),  
  153.         ((5.82.75.11.9), (001)),  
  154.         ((6.83.25.92.3), (001)),  
  155.         ((6.73.35.72.5), (001)),  
  156.         ((6.73.05.22.3), (001)),  
  157.         ((6.32.55.01.9), (001)),  
  158.         ((6.53.05.22.0), (001)),  
  159.         ((6.23.45.42.3), (001)),  
  160.         ((5.93.05.11.8), (001))]  
  161.   
  162. feature = np.vstack([np.array(i[0]) for i in iris])  
  163. label = np.vstack([np.array(i[1]) for i in iris])  
  164.   
  165. x = feature[:, 2:4]  # use "Petal length" and "Petal width" only  
  166. y = label  
  167. d = x.shape[1]  
  168.   
  169. def tf_kron_prod(a, b):  
  170.     res = tf.einsum('ij,ik->ijk', a, b)  
  171.     res = tf.reshape(res, [-1, tf.reduce_prod(res.shape[1:])])  
  172.     return res  
  173.   
  174.   
  175. def tf_bin(x, cut_points, temperature=0.1):  
  176.     # x is a N-by-1 matrix (column vector)  
  177.     # cut_points is a D-dim vector (D is the number of cut-points)  
  178.     # this function produces a N-by-(D+1) matrix, each row has only one element being one and the rest are all zeros  
  179.     D = cut_points.get_shape().as_list()[0]  
  180.     W = tf.reshape(tf.linspace(1.0, D + 1.0, D + 1), [1, -1])  
  181.     cut_points = tf.contrib.framework.sort(cut_points)  # make sure cut_points is monotonically increasing  
  182.     b = tf.cumsum(tf.concat([tf.constant(0.0, shape=[1]), -cut_points], 0))  
  183.     h = tf.matmul(x, W) + b  
  184.     res = tf.nn.softmax(h / temperature)  
  185.     return res  
  186.   
  187.   
  188. def nn_decision_tree(x, cut_points_list, leaf_score, temperature=0.1):  
  189.     # cut_points_list contains the cut_points for each dimension of feature  
  190.     leaf = reduce(tf_kron_prod,  
  191.                   map(lambda z: tf_bin(x[:, z[0]:z[0] + 1], z[1], temperature), enumerate(cut_points_list)))  
  192.     return tf.matmul(leaf, leaf_score)  
  193.   
  194. num_cut = [11]  # "Petal length" and "Petal width"  
  195. num_leaf = np.prod(np.array(num_cut) + 1)  
  196. num_class = 3  
  197.   
  198. sess = tf.InteractiveSession()  
  199. x_ph = tf.placeholder(tf.float32, [None, d])  
  200. y_ph = tf.placeholder(tf.float32, [None, num_class])  
  201. cut_points_list = [tf.Variable(tf.random_uniform([i])) for i in num_cut]  
  202. leaf_score = tf.Variable(tf.random_uniform([num_leaf, num_class]))  
  203. y_pred = nn_decision_tree(x_ph, cut_points_list, leaf_score, temperature=0.1)  
  204. loss = tf.reduce_mean(tf.losses.softmax_cross_entropy(logits=y_pred, onehot_labels=y_ph))  
  205. opt = tf.train.AdamOptimizer(0.1)  
  206. train_step = opt.minimize(loss)  
  207. sess.run(tf.global_variables_initializer())  
  208. for i in range(1000):  
  209.     _, loss_e = sess.run([train_step, loss], feed_dict={x_ph: x, y_ph: y})  
  210.     if i % 200 == 0:  
  211.         print(loss_e)  
  212. print('error rate %.2f' % (1 - np.mean(np.argmax(y_pred.eval(feed_dict={x_ph: x}), axis=1) == np.argmax(y, axis=1))))  
  213. sample_x0 = np.repeat(np.linspace(0, np.max(x[:,0]), 100), 100).reshape(-1,1)  
  214. sample_x1 = np.tile(np.linspace(0, np.max(x[:,1]), 100).reshape(-1,1), [100,1])  
  215. sample_x = np.hstack([sample_x0, sample_x1])  
  216. sample_label = np.argmax(y_pred.eval(feed_dict={x_ph: sample_x}), axis=1)  
  217. plt.figure(figsize=(8,8))  
  218. plt.scatter(x[:,0],   
  219.             x[:,1],   
  220.             c=np.argmax(y, axis=1),   
  221.             marker='o',  
  222.             s=50,  
  223.             cmap='summer',   
  224.             edgecolors='black')  
  225. plt.scatter(sample_x0.flatten(),   
  226.             sample_x1.flatten(),   
  227.             c=sample_label.flatten(),   
  228.             marker='D',  
  229.             s=20,  
  230.             cmap='summer',   
  231.             edgecolors='none',  
  232.             alpha=0.33
The output that I got is
 
1.1529802
0.32481167
0.1562062
0.15097024
0.118773825
error rate 0.04  
 
tf_image 
 

Conclusion

 
In this article, we studied what is decision tree, when is decision tree used, assumptions of decision tree, key terms, how to make decision tree, advantages of decision tree, disadvantages of decision tree, types of decision tree, what is the importance of decision tree, regression tree vs classification tree, entropy, information gain and gini index calculations, decision tree example, python implementation of decision tree using sklearn, numpy, and TensorFlow. Hope you were able to understand each and everything. For any doubts, please comment on your query.
 
In the next article, we will learn about Naive Bayes.
 
Congratulations!!! You have climbed your next step in becoming a successful ML Engineer.
 
Next Article In this Series >> Naive Bayes