How To Setup Azure Synapse Analytics With Private Endpoint

Introduction

In this article, we will discuss how to setup Azure Synapse Analytics with Private endpoint. Azure Synapse Analytics enables data analysts, data engineers, data scientists and BI analysts to create data products under one tool. Data and network securities are very important for this PaaS service, and we need to ensure all the data traffic are kept in private networking and away from publish internet.

Before we dive into the setup, let's review the three main capabilities of Synapse Analytics:

  1. Serverless SQL Pool: This component is created by default. It allows you to query the data in the data lake using SQL. This component provides an easy way to query data without the need of programming in Spark. This pool is paused automatically when not in used.

  2. Dedicated SQL Pool: This SQL pool is for the enterprise data warehouse. This component allows data to be processed and stored directly in the database. Generally, this pool is kept running all the time but can be paused manually, if required.

  3. Apache Spark Pool: This component uses Apache Spark to provide analysts read and transform data in the data lake. This service can be used for data engineering and Machine Learning tasks.

Challenges

A common request the data team received is the ability to access to the data in the data platform. As a result, many data engineers focus on developing data pipeline to move data from the data lake into a SQL service to provide access.

Azure Synapse analytics simplifies this process by providing an all-in-one service for both business and technical teams to work together. To provide a safe and secure environment for all the teams, we need to setup Private endpoint to keep data traffics private and protect against data exfiltration. The Pros and Cons are:

Pros

  1. Private endpoint forces all networking traffic between the pools and the storage account to go through the private networking. This limits the traffic from the public internet.

  2. A managed Resource group is used to stores all the Virtual machine and networking resources for the serverless pools. This allows the Data Platform admin team to setup monitoring or cost management. The resources in this resource group will be managed by Azure Synapse Analytics so no maintenance is required.

Cons

  1. There is an added cost of using Private endpoint. Both ingress and egress traffic will be charged.

  2. Synapse Analytics requires "Storage Blob Data Contributor" access to a data lake. This access grants Synapse both Read and Write access to the whole data lake. For data security, we created a separate data lake dedicated to this service.

Tutorial

In this tutorial, we will create a new Azure Synapse Analytics workspace. A new resource group is created to store the Synapse workspace and the dedicated data lake.

  1. Setup Data Lake

    • We will skip the data lake creation as we have an in-depth tutorial in our previous article.
  2. Create Synapse workspace - Basics

    1. Under Azure Synapse Analytics, click on "Create Synapse workspace".
    2. Select the resource group we created in advance.
    3. We want to use the Managed resource group option. If not provided, Azure will create a resource group for this.
    4. Provide the workspace a name and select the region.
    5. Select the dedicated data lake (sbdpsynapselake) for this Synapse workspace. A container is created using the workspace name. This container should only be use by the Synapse workspace only. We will create a new container to store our data.
    6. Click "Next: Security" to continue.

  3. Create Synapse workspace - Security
    1. Select both local and Azure AD authentication. We are using this option because we want to have a local database user in case Azure AD becomes unavailable.
    2. Allow network access to Data lake is checked. This will grant this Synapse workspace access to the Data lake for Serverless and Dedicated SQL Pool.
    3. Click "Next: Networking" to continue.

  4. Create Synapse workspace - Networking

    1. Enable "Managed virtual network" and "Create managed private endpoint to primary storage account". This option enables Synapse workspace access to the Data lake for Spark pool and integration pipelines. This requires the storage account owner to approve the private endpoint request.
    2. Disable "Allow outbound data traffic only to approved targets". If this option is enabled, it will prevent data exfiltration but it will not allow Synapse to connect to external data sources or common Python public repositories like Python Package Index (PyPI).
    3. Enable "Public network access to workspace endpoints". This option allows users to connect to Synapse workspace through public internet. If a private connection is available through VPN or Expressroute, this option should be diabled.
    4. Click "Next: Tags" to continue.

  5. Create Synapse workspace - Tags

    1. Create the tags for cost tracking.
    2. Click "Next: Review + create" and continue to create the Synapse workspace.

  6. Approve Synpase Private endpoint

    1. Now with Synapse deployment completed, we need to approve the Private endpoint for the primary storage account.
    2. Navigate to the storage account, click on "Networking" under "Security + networking".
    3. Select the "Private endpoint connections" tab.
    4. Select the checkbox for the Pending connection from synapse and click Approve.

  7. Before we go into the Synapse workspace, let's upload our sample file.
    1. Download the NYCTripSmall.parquet file (link).
    2. Navigate to the Storage browser.
    3. Create our new container "data"
    4. Upload the sample NYCTripSmall.parquet file from our desktop to the "data" container.

  8. Navigate to Synapse workspace and click on "Open Synapse Studio".

  9. In the Synapse Studio, we will click on 'Data' menu. This menu shows all the data we have connected to. We can see our primary data lake.

    1. Click on "Data" menu.
    2. Select the "Linked" tab then select the "data" container.
    3. We can see our NYCTripSmall.parquet file.

  10. A Serverless SQL Pool is automatically created when we create the Synapse workspace. Let's test this out to validate everything is working.

    1. Select "NYCTripSmall.parquet".
    2. Click on "New SQL script" then "Select TOP 100 rows".
    3. A new tab is created with "SQL script 1".

  11. Click on "SQL script 1" tab and we can see the auto-generated code. Lets click "Run". We can see the result of our query. It took 9 seconds to complete.

References

Summary

Azure Synapse Analytics provides a secure environment for different teams to work together. With private endpoint, we can secure our data traffic in the private network. If additional secure measures are required, we can disable access from public internet all together by having a VPN tunnel or Expressroute.

In addition to having SQL and Spark together under one service, we can also upload new data through the use of the 'Data' menu. All those components are working together to provide a single workspace for all the teams to work together efficiently.

Happy Learning!