Regularization Machine Learning

In this article, we’ll learn about Regularization in Machine Learning. We’ll learn the basics of regularization, why it is necessary and so important in machine learning. We’ll then briefly get introduced to Implicit Regularization and Explicit Regularization. Thereafter we’ll get to know about L1 and L2 regularization and Lasso and Ridge regression.

Overfitting

When machine learning models are trained, focusing heavily only on the training set such that the model attempts to cover all the data points in the dataset making it a constraint to the training data alone, which now makes the model incapable to perform well on other unseen data. This phenomenon is known as overfitting.


Source: Wikipedia

As we can see in the image, a good model will bound a prediction with a homogenous linear differential equation.  Howsoever, the order of the equation should be reduced to a minimal to obtain optimum results for real data inputs. Here, the blue dots and red dots signify two different prediction output. While the model tries to cover all the red dots, it clearly seems to overfit. The boundary in green line showcases what an overfit model looks like. A better model which can give more accurate and precise prediction would be the curve in black.

Regularization


Source: Wikipedia

Regularization is the process to simplify the resulted answer. It is mainly used to prevent overfitting as well as to get results for the ill-posed problem sets. The process of regularization can be divided into many ways. One such way is delineation which subdivides regularization as Implicit regularization and Explicit Regularization. Regularization supports minimizing the errors by appropriately fitting the functions on the training data decreases the possibility of overfitting.

Implicit Regularization

Implicit Regularization is the most important form of regularization its implication in the process of machine learning in the modern times is paramount. From discarding the outliers to early stopping and usage of a robust loss function, the implicit regularization is used. It is used in ensemble methods like gradient boosted trees, random forests, XGBoost and in the cases of stochastic gradient descent in training the deep neural network.

Explicit Regularization

When we add terms explicitly to optimization problems, we call it the explicit regularization. The terms could be in any form such as constraints, penalties and priors. These penalties and regularization terms help make the optimal solution stand out as unique as it imposes the costs on the optimization function. In the cases of the ill-posed optimization problems, the explicit regularization is used.

What is Regularization in Machine Learning?

In machine learning, overfitting is one of the common outcomes which minimizes the accuracy and performance of machine learning models. To overcome this, regularization is a method to solve this issue of overfitting which mainly arises due to increased complexity. Regularization focuses on controlling the complexity of the machine learning model by the process of penalizing higher terms of the model thus reducing the unimportant features. This helps to mitigate the loss and reduce the complexity of the machine learning model to produce a better model which can obtain better results in datasets and real-world applications and scenarios beyond the data used in training and testing.


Source: ResearchGate

L1 Regularization

Commonly known as the regularization for sparsity, the L1 Regularization as its name suggests focuses on handling the sparse vectors that usually consist of Zeros. The sparse vectors usually result into extremely huge dimensional feature vector space and become monumentally complex to handle. Thus, the L1 Regularization comes to the rescue forcing the uninformative features to reduce its weights to zero by the iterative process of subtraction of small amount from the weight resulting the weight to zero.

To sum it up in a sentence, the L1 Regularization basically penalizes the weights of the uninformative features.

L2 Regularization

Commonly known as the regularization for simplicity, the L2 regularization similarly to L1 regularization forces the weight values to decrease but never exactly results the weight to zero. This is because, unlike L1 regularization, the L2 regularization removes a minute percentage of the weight in different iteration which never equates the weight to zero. The concept of L2 regularization comes from the theory of model complexity. When we consider the function of weights for model complexity, it is estimated that the complexity of the features is directly proportional to the absolute value of the weights. Thus, the L2 regularization penalizes in the square value of the weights.

Furthermore, the regularization rate i.e.. Lambda is a parameter that can be used to tune the L2 regularization.

L1 VS L2 Regularization

There are some key differences between the L1 and L2 Regularization. L1 regularization is calculated with the sum of the absolute value of the weights whereas L2 regularization takes the squares of the weights and then sums it. Unlike L1 the L2 never results the weights to zero. Yes, it tends to zero but it would never be zero though. As we have mentioned above, the L1 regularization is also called regularization for sparsity. Thus, the L1 regularization can have multiple sparse solution whereas the L2 regularization has non-sparse and single solution. Furthermore, the L1 regularization has built-in feature selection and are robust to the outliers whereas the L2 regularization neither support any feature selection and are not robust to the outliers mainly due to the square term of weights.

Lasso Regression

The regression model that uses the L1 regularization technique is known as Lasso Regression. The Lasso regression is used mainly for obtaining highly accurate prediction. It is believed that the Lasso regression is better than its counterpart Ridge. In the lasso regression process, shrinkage occurs such that the data values are shrunk to the central point like mean values. The penalty that is used in this process is called the lasso penalty and this reduces the value of the coefficient to zero.


Source: ResearchGate

Ridge Regression

The regression model that uses the L2 regularization technique is known as the Ridge Regression. Its is also known as Tikhonov Regularization. When data face multicollinearity, ridge regression can be used to tune the model. In the cases of multicollinearity variance is huge and least-square values are basically unbiased which causes an extreme difference in actual values and predicted values.

Conclusion 

Thus, in this article we learned about Regression, and its importance in machine learning and got briefly introduced to different types of regularization, i.e.. L1 and L2 regularization as well as lasso and ridge regression.  

References