Mastering Gradient Boosting for Regression

Gradient Boosting is a popular and powerful machine learning algorithm that has gained significant attention in the field of predictive modeling. It has proven to be highly effective in various domains, including regression, classification, and ranking tasks.

In this article, we will explore the fundamentals of Gradient Boosting, and understand how it works for regression.

Consider the following dataset as an illustrative example:

Gradient Boosting for Regression

Before delving into the intricacies of how the algorithm functions, let’s familiarize ourselves with some key terms used in this context.

Decision Tree: A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It is a flowchart-like model where each internal node represents a feature or attribute, each branch represents a decision based on that feature, and each leaf node represents the outcome or prediction.

Gradient Boosting for Regression

In the image provided, the female node serves as a decision point, with two branches stemming from it. Each branch is determined based on the answers “yes” and “no”. If the new data aligns with the condition of being female, it follows the “yes” branch. Conversely, if the new data does not satisfy the condition, it follows the “no” branch.

Residuals: Residuals in the context of statistics and machine learning refer to the differences between the observed values and the predicted values. In regression analysis, the residuals represent the deviation or error of the actual data points from the fitted regression line.

To calculate residuals, you subtract the predicted value (obtained from a regression model) from the corresponding observed value. The residual can be positive or negative, indicating whether the observed value is higher or lower than the predicted value.

Suppose we have a sequence of values: 20, 35, 66, 15, and 62 in that order. The average of these values is calculated as 39.6. If we subtract this average from the first data point, which is 20, the resulting residual would be -19.6.

Gradient descent: Gradient descent is an optimization algorithm commonly used in machine learning and mathematical optimization. It is used to find the minimum (or maximum) of a function by iteratively updating the parameters or variables based on the gradient of the function.

The basic idea of gradient descent is to start with an initial guess for the parameters.

Gradient Boosting for Regression

Based on the given example, let’s explore how this algorithm works:

Step 1. Calculate the average weight. In this algorithm, the weight attribute is considered as we intend to train the algorithm to predict weight.

Gradient Boosting for Regression

The average weight attribute is 71.2.

Step 2. Find the residual of each data point.

To find the residual of each data point, we must subtract the average from weight of each data point. The result will be as follows,

Gradient Boosting for Regression

Step 3. Make a decision tree.

In this example, the top leaf node represents the gender “Female.” The decision tree branches out into two derived nodes: one based on the condition “height < 1.6” and the other based on the condition “Color not blue.” These derived nodes serve as additional decision points that further classify the data based on the specified criteria.

Gradient Boosting for Regression

Let’s get the residuals based on the above conditions.

Gradient Boosting for Regression

Next, let’s calculate the average of the multiple residuals obtained under a specific condition.

Gradient Boosting for Regression

By substituting the average value with the multiple residuals, the decision tree is transformed into the following structure.

Gradient Boosting for Regression

Now let’s take the initial condition based on the first dataset.

Gradient Boosting for Regression

Based on the given dataset, the condition of the decision tree can be defined as follows: Gender = Male and Color = Blue. By following this condition down the decision tree, we obtain a residual value of 16.8. When we add this residual to the average weight calculated in Step 1 (71.2), we obtain the exact weight of the dataset, which is 88. However, this scenario poses a problem. Our model hasn’t learned the underlying patterns; instead, it has memorized the dataset. This issue is commonly referred to as the Overfitting Problem, where the model becomes too specific to the training data and performs poorly on new, unseen data.

What is Over-Fitting Problem?

Overfitting is a common condition in machine learning where a model performs exceptionally well on the training data but fails to generalize well to new, unseen data. In other words, the model “memorizes” the training examples instead of learning the underlying patterns or relationships.

To overcome this problem, we are optimizing this algorithm using Learning Rate.

Learning Rate

The learning rate is a hyperparameter that controls the step size or the rate at which a model learns during the optimization process, particularly in gradient-based optimization algorithms like gradient descent. 

The learning rate typically ranges from 0 to 1, although it is not limited to this range and can be any positive value.

In this example, we will consider the learning rate as 0.1, denoted by the symbol α (alpha).

Now let’s apply this α = 0.1 in this formula (Average + ( α x residual of that data point)

=> 71.2 + (0.1 x 16.8) //For First Data point

=>72.9 // This is called new prediction value

=> 71.2 + (0.1 x 4.8) //For Second Data point

=>71.7 // This is called new prediction value

Repeat this step to find for all the residuals in the dataset.

Step 4. Finding the Updated Residuals.

By subtracting the new prediction value from the weights in the dataset, we can obtain the updated residuals.

Gradient Boosting for Regression

With these updated residuals, we need to construct a new decision tree using the same condition as before.

Gradient Boosting for Regression

Next, let’s calculate the average of the multiple residuals obtained under a specific condition.

Gradient Boosting for Regression

By substituting the average value with the multiple residuals, the decision tree is transformed into the following structure.

Gradient Boosting for Regression

Now let’s take the initial condition based on the first dataset.

Gradient Boosting for Regression

Based on the given dataset, the condition of the decision tree can be defined as follows: Gender = Male and Color = Blue. By following this condition down the decision tree, we obtain a residual value of 15.1. Now let’s substitute the values in the below formula,

Average (Calculated in Step 1) + (α x residual of that data point) + (α x updated residual of that data point).

=> 71.2 + (0.1 x 16.8) + (0.1 x 15.1) //For First Data point

=> 74.4 //This is called new prediction value

Previously, we obtained a value of 72.9, and now with the updated residuals, we have a value of 74.4. Gradually, our algorithm is progressing towards the observed value of 88.

By iteratively repeating these steps, our decision tree will expand and provide us with the predicted value.

So this is how the Gradient Boosting algorithm for regression works.


Similar Articles