XGBoost 101

XGBoost 101

In the previous article, we got introduced to XGBoost and learned about various reasons for its wide acceptance in Machine Learning Competition while finding out what resulted in XGBoost becoming such a great performer of an algorithm. In this article, we’ll learn about the installation of XGBoost in Anaconda using Amazon SageMaker. We’ll also learn about the different types of problems that can be addressed using XGBoost and about the benefits of using XGBoost in Amazon SageMaker. To learn about the fundamentals of the XGBoost, read the previous article, XGBoost – The Choice of Most Champions.  

Types of Problems Addressed by XGBoost  

XGBoost is well suited to solve a range of different types of Machine Learning problems with major capabilities to address problems specifically of Classification, Regression, and Ranking. Let us discuss each of them in more detail.  

Classification

Classification focuses on taking input values and then organizing them into two more numerous other categories. Let us take an instance of fraud detection. The goal of the fraud detection system is to take the information of the transaction and then determine whether the transaction is fraudulent or not. When the dataset of the past transactions ie. The history of the transactions is given, the XGBoost algorithm can help learn the function which maps input transactions data to the fraudulent probability of the transaction.  

Regression

In contrast to the classification where the inputs are mapped to the discrete number of classes, with regression, the output is a number. A well-used example of a regression problem is the house price prediction model. Here, historical data of the houses is given with its selling price, and numerous other key data which using the XGBoost algorithm, the function can learn to predict the selling price of the give through the metadata given of the new house. 

Ranking

The ranking is a process that deals with attributing importance to a document in order of relevance. XGBoost works extremely well with problems related to ranking. One of the good analogies can be that of ranking different videos on youtube. The data of search results and watch time and clicks on the recommendation can help learn and apply XGBoost for training. This would produce a model of different types of videos with relevance scores as per the user. This would help with the recommendation engine for the video as per the user's taste. TikTok, Spotify all use a similar approach. Moreover, even e-commerce websites such as Amazon and Ebday use ranking extensively at their core.   

XGBoost library can be easily used in the local engine as well as in the cloud through services such as Amazon SageMaker. Let us learn to install XGBoost in Anaconda. 

Anaconda

Anaconda is a distribution for scientific computing that is an easy-to-install free package manager and environment manager with an enormous collection of over 720 open-source packages offering free community support for R and Python programming languages that support Windows, Linux, and Mac OS and also ships with Jupyter Notebook. 

How to Install XGBoost in Anaconda 

First of all, let us Install Ananconda Environment in our system from the official website. There is a free individual edition that one can benefit from for learning purposes.  

After the installation, please open the terminal of Anaconda Prompt and check if there are any updates. 

conda update –all

After this, you can use the following command to install the xgboost package for python in an anaconda environment. 

conda install -c anaconda py-xgboost

With XGBoost now set up, you can use the modules from xgboost by calling with import in your Jupyter Notebook. This package can support programming languages from Python, R, C++, Scala, and Java and can run on a single machine as well as Spark, Hadoop, DataFlow, and Flink.  

Using XGBoost in Amazon SageMaker 

Amazon SageMaker enables data scientists and machine learning engineers to access XGBoost algorithms in basically two ways, one as built-in algorithms and another as a framework. While using XGBoost as an algorithm, there is more flexibility compared to using the built-in algorithms and thus can also access to even more advanced scenarios, for instance, the k-fold cross-validation as training scripts can also be customized.  

Using XGBoost as Framework 

XGBoost can be used as a framework and customized training scripts can be run easily. The following example shows SageMaker Python SDK providing the XGBoost API as a framework similar to that provided for PyTorch and TensorFlow.  

import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import s3_input, Session

#initializing the hyperparameters

hyperparameters = {

        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "verbosity":"1",
        "objective":"reg:linear",
        "subsample":"0.7",  
        "num_round":"50"}

#output path is set to the s3 bucketwhere the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-framework'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework')

#construct a SageMaker XGBoost estimator
#entry_point to the xgboost training script is specified

estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py",

                    framework_version='1.2-2',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)

#paths and data type and defined to the training and validation datasets

content_type = "libsvm"
train_input = s3_input("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = s3_input("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execution of the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

Benefits of implementing XGBoost via Amazon SageMaker

There are numerous benefits of implementing the XGBoost through the Amazon SageMaker. Learning about these key pros would make anyone find the lucrative prospects of using XGBoost through SageMaker for sure.  

Scalability and Distributed System

A massive amount of data can be trained with XGBoost in Amazon SageMaker on numerous machines. It's as easy as setting the number of machines and the size up to which one wants to scale out. Everything else for the distribution and scalability is taken care of by Amazon SageMaker.  

Fragmentation

Data can be partitioned in the Amazon S3 bucket for training which allows downloading the datasets in the partition to the individual nodes in contrary to downloading all of the datasets to one single node which can create a bottleneck. Moreover, time for the downloading of the dataset is also reduced supporting the speeding up of the training processes.  

A/B Testing

Numerous XGBoost models can be run simultaneously each with a different weight for inference. This A/B Testing, supported natively by Amazon SageMaker, can help customers to determine the best models out of the numerous tested for their usage.  

Deployment and Managed Hosting for Models

Once a model has been trained with XGBoost, all we’ll ever need then is only one single API call to deploy it to the production. Amazon SageMaker hosting environment is extremely well managed with the support for auto-scaling that enables the reduction in overhead expenses for the operational cost of running the hosting environment making it economically lucrative.  

Instance Weighted Training for XGBoost

Weight can be added to the individual data points while using XGBoost on Amazon SageMaker. The individual data points are also referred to as instances while training. The importance of different instances can then be easily differentiated during the process of model training by simply assigning weight values to them. 

Conclusion

In this article, we learned about different types of problems that are addressed by the XGBoost such as Classification, Regression, and Ranking. We also learned about the procedure to install XGBoost in the local environment through Anaconda and on Amazon SageMaker via Python SDK. Moreover, we also learned about the benefits of using XGBoost through Amazon SageMaker. This would help us in the long run when we are constantly updating our models to our production environment.