Azure Synapse Analytics - Machine Learning

In this article, we’ll learn about the machine learning capabilities provided by Apache Spark in Azure Synapse Analytics. We’ll continue to understand the machine learning process for big data especially focused on Apache Spark in Azure Synapse. You can learn more about the Azure Offerings of Apache Spark in the previous article, Apache Spark. This article is a part of the Azure Synapse Analytics Articles Series. You can check out other articles in the series from the following links.

  1. Azure Synapse Analytics     
  2. Azure Synapse Analytics - Create Dedicated SQL Pool     
  3. Azure Synapse Analytics - Creating Firewall at Server-level     
  4. Azure Synapse Analytics - Connect, Query and Delete Data Warehouse SQL Pool     
  5. Azure Synapse Analytics – Load Dataset to Warehouse from Azure Blob Storage     
  6. Azure Synapse Analytics - Best Practices to Load Data into SQL Pool Data Warehouse    
  7. Azure Synapse Analytics – Restore Point   
  8. Azure Synapse Analytics – Exploring Query Editor  
  9. Azure Synapse Analytics – Automation Task  
  10. Azure Synapse Analytics – Machine Learning 

Azure Synapse Analytics 

Azure Synapse is a limitless enterprise analytics service that enables us to get insight from data analytics and data warehousing. Using dedicated resources or serverless architecture, data can be queried and provides scalability as per the increase in the size of the data. You can learn more about it in the previous article, Azure Synapse Analytics.

Apache Spark   


Source: Microsoft 

Developed at the AMPLab of University of California, Berkeley, Apache Spark is an analytics engine dedicated to the processing of large-scale data. Fault Tolerance and Data Parallelism is provided with programming cluster interface. The big-data analytics application performance can be boosted with Apache Spark with its parallel processing framework which supports in-memory processing. 

In-Memory cluster computing is supported by Apache with loading and caching of data into memory performed with spark job which can then be queried thereafter. We know that when compared to the disk-based applications like Hadoop that used Hadoop Distributed File System (HDFS), in-memory computing outshines with its faster processing.  Moreover, distributed data sets can be manipulated as local collections with the integration of Spark into Scala. 

Machine Learning with Apache Spark in Azure Synapse Analytics

The capability of machine learning with big data is enabled by Apache Spark offered within Azure Synapse Analytics. High value insights and analytics can be performed from fast-moving structured as well as unstructured data. Let us dive into the entire machine learning workflow for big data from data analysis to training models ad deployment.


Source: Microsoft Announcement on TechCrunch

Synapse Runtime

The environment for machine learning and data science is specifically curated in Azure which we call Synapse Runtime. A wide range of open-source builds and libraries are offered in Synapse Runtime along with the Azure Machine Learning SDK. From PyTorch, TensorFlow, Scikit-learn, XGBoost and many more, Synapse Runtime includes numerous external libraries.

Data Pipeline

In order to build data pipelines and access and transform the data into a proper format which can be then utilized for machine learning purposes, we need a tool which can support in data ingestion and data orchestration. Azure Data Factory solves all these issues with its powerful set of tools which are available to make sure no steps on the ingestion and orchestration pipeline will be missed.

Supported Machine Learning Libraries

There are varieties of built-in and third-party libraries for machine learning that are supported for Apache Spark in Azure Synapse Analytics. First of all, let us talk about the built-in libraries.

MLlib and SparkML

MLlib is the machine learning library for Spark which makes machine learning easy and scalable. It supports numerous machine learning algorithms from classification to clustering, regression and collaborative filtering. Spark.ML is the new package introduced in Spark since Spark 1.2 and provides high-level APIs which supports machine learning engineers to create and tune the pipelines of machine learning. For iterative algorithms which are used in graph computation and machine learning, the in-memory computation capability of Spark’s makes it a great choice.

Now, let us dive into some open-source libraries which can be used for Apache Spark in Azure Synapse. The Spark pool in Azure Synapse Analytics is pre-loaded with different machine learning libraries.

TensorFlow and PyTorch

TensorFlow and PyTorch are hands-down the most powerful machine learning libraries specially focused for deep learning. With minimal costing, single-machine learning models can be created using these libraries in Apache Spark pool of Azure Synapse Analytics settings zero to the number of executers option on pool.

Scikit-learn

For classical machine learning algorithms, scikit-learn is one of the famous and mostly used libraries. IT supports both supervised as well as unsupervised machine learning algorithms for data analysis and data mining.

XGBoost

XGBoost is a very fine machine learning library which has grown mass appeal over the years since its inception. It is an optimized algorithm which contains decision trees and random forest approach for training. You can learn more about it in my previous articles, XGBoost - The Choice Of Most Champions and XGBoost 101.

Machine Learning Workflow

One needs to follow a proper workflow for machine learning process in order to attain their goals. Here is the described process for Machine Learning specially focused for Big Data in Azure Spark in Azure Synapse Analytics.

Data Exploration and Analysis

During the Data Exploration process, the built-in Synapse Notebook chart options makes it extremely convenient to visualize the data. Furthermore, with the access to Matplotlib and Seaborn and integration with Power BI and Synapse SQL, exploring data, preparing and analysis is very easy in Azure Synapse Analytics.

Feature Engineering

Synapse Runtime allows the use numerous libraries for the process of feature engineering. Depending upon the size of data sets, we can choose the libraries to work with. For small datasets, third-party libraries such as Scikit-learn, Numpy and Pandas can be used and for large datasets, Koalas, MLlib and Spark SQL can be used.

Model Training

Numerous options for training the machine learning models are available with Apache Spark in Azure Synapse Analytics which includes built-in libraries such as Azure Machine Learning and Apache Spark MLlib. Moreover, there are tons of other open-source libraries that are supported too.

Using the Apache Spark Pools and tools such as .NET, Scala and PySpark, Python, we can easily train our models in Azure Synapse Analytics. We can also use popular libraries like Scikit-learn for this step. We can easily manage numerous third-party libraries for training our models.

Furthermore, Automated ML is also a new option which is available now. With Automated ML, we can conveniently train sets of machine learning models and then let the user choose the best model from the specific metrics provided. I’ve discussed more about Automated ML in my previous article, Auto ML. Moreover, the integration between the Azure Synapse Notebooks and Azure Machine Learning is seamless which allows the user to leverage Auto ML in Synapse without any need to work through AD authentication (Active Directory Authentication). This allows users without any entering of credentials to use the Auto ML by simply pointing at the Azure Machine Learning Workspace. This allows developers, machine learning engineers, analysts and data scientists to build highly scalable, efficient and productive Machine Learning models rapidly with maintaining the model quality.


Source: Microsoft

Model Development Tracking

The lifecycle of our machine learning experiments can now be easily managed with an open-source library called MLFlow. The metrics of our training and model artifacts can be logged and tracked using the MLFlow Tracking component of MLFlow. This allows us as Machine Learning Engineers to tracks the process of our model development and identify the cases at times of overfitting or if any changes are required with the change of new data sources and more.

Model Evaluation and Scoring

Also known as inferencing, the Model Scoring is a vital phase where we use model to make the predictions. We can use the native Spark methods in order for scoring to be straight away performed on Spark Dataframe using MLLib or SparkML. For small datasets, the inference methods native for the models which are accessible in the library can be utilized. In case of large datasets, other open-source libraries can be used to create Spark UDF which can then be used to scale out inference.

Deployment

It is especially easy with Azure Synapse to use the models trained in Azure Synapse Analytics as well as outside to be accessed for batch scoring. Majorly, the Azure Synapse provides two processes to run the batch scoring.

Firstly, as we are working for Apache Spark, we can leverage the Apache Spark Pool provided in Azure Synapse Analytics itself. This allows us to perform the batch scoring for the machine learning models. Now, with the kind of libraries that were utilized to train the models, we can run our batch scoring accordingly.

Another process is to use the function provided in Synapse Pool to run the predictions the very location data are stored. TSQL PREDICT function provides this capability and is a scalable and powerful function. This function makes it possible to enrich our data without moving data away from the data warehouse itself. ONNX models can be deployed in Synapse Studio for batch scoring using PREDICT from Azure Machine Learning model registry in Synapse SQL Pools.

Conclusion

Thus, in this article, we learned about the process of machine learning for big-data purpose using Apache Spark in Azure Synapse Analytics. The steps of machine learning workflow from data exploration, model training, scoring, and deployment were discussed.


Similar Articles