Challenges In Machine Learning (Training And Validation) - Part Three

Stephen Simon
5y
5.9k
0
7

Article

Introduction

In the previous two articles of this series Challenges In Machine Learning - Part One and Challenges In Machine Learning - Part Two we learned about challenges faced in Machine Learning Project with respect to bad algorithms and bad data, and now in this article, we’ll learn how to manipulate the data to get the best accuracy of machine learning model.

Testing and Validating

One and only way to know how well a model will generalize to new cases is to actually try it out on new cases. One way to do that is to put the model in production and monitor how well it performs. This works well, but if the model is horribly bad, the users will complain — not the best idea. A better option is to split the data into two sets: the training set and the test set. As these names suggest, we train our model using the training set and then test it using the test set. The error rate on testing data is known as generalization error also called out-of-sample error, and by evaluating the model on the test set, we can get an estimation of this error. This value tells, how well the model will perform on new instances, that is it has never seen before. If the training error is less (i.e., the model makes few mistakes on the training set) but the generalization error is high, it means that the model is overfitting the training data.

Evaluating the model is pretty easy, all you need to do is use the test case. Now suppose you are hesitating between two models (say a linear model and a polynomial model): how can you decide? One way is to train both and compare how well they generalize using the test set. Now suppose that the linear model generalizes better than the polynomial model, but you want to apply some regularization to avoid overfitting. But the question arises here that, how do you choose the value of the regularization hyperparameter? One possibility is to train 100 different models using 100 different values for this hyperparameter.

Suppose you are to find the best hyperparameter value that produces a model with the lowest generalization error for instance 5% error. So expecting the same, you launch this model into production, but for some reason, it does not perform as well as expected and produces 15% errors.

What just happened?

The issue is that we calculated the generalization error numerous times on the test set, and we adapted the model and hyperparameters to produce the best model for that set, which means that the model is once again unlikely to perform as well on new data. A usual solution to this problem is to have a second set of data also known as a validation set.

In most of the cases, we train various models with multiple hyperparameters using the training set and then select the hyperparameters and model that perform best on the validation set, and finally, we run the model one single final test against the test set to get approximate of the generalization error. This takes a load of time and to avoid this “wasting” of too much training data invalidation sets, a usual technique is to adapt cross-validation.

In a cross-validation process, the training set is divided into harmonious subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts. Once the hyperparameters and model type have been carefully chosen, a final model is trained using these hyperparameters on the full training set, and the generalized error is measured on the test set.

Summary

This brings us to the end of the three-part series of Challenges in Machine Learning Project. To learn more about machine learning from scratch you can always go ahead and read First Guide To Machine Learning.