Two-Class Boosted Decision Tree

Overview

Two-Class Boosted Decision Tree module creates a machine learning model that is based on the boosted decision trees algorithm. A boosted decision tree is an ensemble learning method in which the second tree corrects for the errors of the first tree, the third tree corrects for the errors of the first and second trees, and so forth. Predictions are based on the entire ensemble of trees together that makes the prediction.

  • Boosted decision trees are the easiest methods with which to get top performance on a wide variety of machine learning tasks.
  • The model then selects the optimal tree using an arbitrary differentiable loss function.

How to Configure a Boosted Tree Model

Step 1

Add the Boosted Decision Tree module to the experiment.

Step 2

Specify how you want the model to be trained, by setting the Create trainer mode option. 

  • Single Parameter
    If you know how you want to configure the model, you can provide a specific set of values as arguments.

  • Parameter Range
    If you are not sure of the best parameters, you can find the optimal parameters by specifying multiple values and using the Tune Model Hyperparameters module to find the optimal configuration. The trainer will iterate over multiple combinations of the settings you provided and determine the combination of values that produces the best model.

Step 3

For the maximum number of leaves per tree, indicate the maximum number of terminal nodes (leaves) that can be created in any tree.

  • By increasing this value, you potentially increase the size of the tree and get better precision, at the risk of overfitting and longer training time.

Step 4

For the minimum number of samples per leaf node, indicate the number of cases required to create any terminal node (leaf) in a tree.

  • By increasing this value, you increase the threshold for creating new rules. For example, with the default value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the training data would have to contain at least 5 cases that meet the same conditions.

Step 5

For Learning rate, type a number between 0 and 1 that defines the step size while learning.

  • The learning rate determines how fast or slow the learner converges on the optimal solution. If the step size is too big, you might overshoot the optimal solution. If the step size is too small, training takes longer to converge on the best solution.

Step 6

For Number of trees constructed, indicate the total number of decision trees to create in the ensemble. By creating more decision trees, you can potentially get better coverage, but training time will increase.

This value also controls the number of trees displayed when visualizing the trained model. if you want to see or print a ingle tree, you can set the value to 1; however, this means that only one tree will be produced (the tree with the initial set of parameters) and n further iterations will be performed.

Step 7

For Random number seed, you can type a non-negative integer to use as the random seed value. Specifying a seed ensures reproducibility across runs that have the same data and parameters.

The random seed is set by default to 0, which means the initial seed value is obtained from the system clock.

Step 8

Select the allow unknown categorical levels option to create a group for unknown values in the training and validation sets.

If you deselect this option, the model can accept only the values that are contained in the training data. In the former case, the model might be less precise for known values, but it can provide better predictions for new (unknown) values.

Step 10

Train the model.

  • If you set Create trainer mode to Single Parameter, connect a tagged dataset and the Train Model module .
  • If you set Create trainer mode to Parameter Range, connect a tagged dataset and train the model by using Tune Model HyperParameter

Experiment with example.

  • Let's work on a blank experiment here.

- New - Dataset - Local File

Next, click From Local File. You will see an upload screen similar to Figure . Here you can specify the upload file properties such as the location of the file, the name for the new dataset (we will use Adult.data.csv), the type of file (Generic CSV file with a header), and an optional description for the new dataset.

Once the new Dataset experiment has loaded, you will then see the Azure ML Studio visual designer screen, as shown in Figure.


Click on Saved Datasets -> My Dataset -> diabetes_readmit_dataset_cleaned-> Drag it on for the dataset Workflow.

Right click on the Dataset-> Dataset-> Visualize. 


Find more data available in the dataset with different features.

Click on Data Transformation- Sample and Split - Split Data .Drag it to the Workflow.

Right click on the Split Data.

Click on Initialize Model- Classification - Two Boosted Decision Tree .Drag it to the Workflow.

Click on Machine Learning- Train- Train Model .Drag it to the Workflow.

Right click on the Train Data.

click on the Lunch Column selector and add readmitted_dttm column.

Click on Machine Learning- Score -Score Model .Drag it to the Workflow.

Right click on the Score Model and go for Visualize.



Click on Machine Learning - Evaluate- Evaluate Model .Drag it to the Workflow.

Click on the Save button at the bottom pane to save the work, which we have done so far.

Right click on the Evaluate Model and go for Visualize.


Dataset is an ideal dataset and the Two-Class SVM algorithm classifies the classes with an accuracy of 100 percent.

Accuracy = (TP+TN)/(TP+TN+FP+FN)= 1
Type 1 Error = (FP)/(FP+TN) = 0
Type 2 Error = (FN)/(FN+TP) = 0