How To Setup Azure Data Factory With Managed Virtual Network

Introduction

In this article, we will discuss how to setup Azure Data Factory with Managed Virtual Network. Azure Data Factory (ADF) is a code free Extract-Load-Transform (ELT) and orchestration service. This PaaS service enables data engineers to create and monitor a data pipeline that can do data ingestion and data transformation. In order to keep all the data movements secure, Azure Data Factory provides an option to run the computes in a dedicated Virtual Network for your instance of Data Factory.

Challenges

In our previous article, we created a private endpoint for our data lake. Now, our data engineers need to start developing data pipelines to generate insight for the business teams. The data team has decided to use the Azure Data Factory and we need to configure it to be safe and secure.

The primary reasons we need to deploy the Data Factory with Managed Virtual Network are to keep the network traffic private and protect against data exfiltration. The Pros and Cons of the Managed Virtual Network are:

Pros

  1. This network is managed by Azure and it does not require additional overhead from the Data Platform admin team.
  2. A private endpoint can be created with the Managed Virtual Network. This will limit the networking traffic to the private network.

Cons

  1. The ability to test connection to other services and debugging require the compute to be started. Traditionally, a stand-by compute is available from Azure which results in a faster respond time.
  2. Since the compute is dedicated to the network, it requires more planning and pipeline design to ensure the compute can be used by multiple processes.
  3. Since we are using Private Endpoint, additional cost will be incurred for ingress and egress.

Create a new Azure Data Factory

In this tutorial, we will create a new Azure Data Factory for our new tutorial project called "project1". To better organize our projects, a new resource group 'sandbox-dataPlatform-project1' is created. We will be creating the new Data Factory under this resource group.

Create Data Factory - Basics

Select the resource group 'sandbox-dataPlatform-project1' and enter in the Data Factory name 'sb-dp-adf-project1', then click 'Next: Git configuration >' to continue.

Create Data Factory - Git configuration

We are going to skip the git setup in this tutorial as we will cover this topic in another article. For an actual project, I highly recommend you setup this integration now.

After selecting 'Configure Git later', click 'Next: Networking >' to continue.

Create Data Factory - Networking

In Data Factory, the compute and the orchestrator are called Integration Runtime (IR). An AutoResolveIntegrationRuntime is created as part of the Data Factory deployment. The 'AutoResolve' means this Integration runtime will automatically detect the best region to use based on the destination location.

We want to enable 'Managed Virtual network' and connect via 'Private endpoint'. This will ensure our integration runtime runs in a separate Virtual Network and using Private IP. Click 'Next: Advanced >' to continue.

Create Data Factory - Advanced

In this tab, we can specify a custom encryption key. In our case, we will let Microsoft manage this so no change is required. Click 'Next: Tags >' to continue.

Create Data Factory - Tags

Let's create the tags required for cost management. This data factory is created for a specific project, so we will assign the cost-center to 'project1'. Click on 'Next: Review + create >' to confirm the settings and proceed to creating our Data Factory.

Azure Data Factory Studio

The 'project1' Data Factory is created, we need to setup a private endpoint to our data lake.

To do this, we need to open the Azure Data Factory Studio,

  1. Navigate to the 'project1' Data Factory.
  2. Select 'Overview' and click on 'Open Azure Data Factory Studio' link.
  3. The ADF Studio will open up in a separate browser tab.

Create an Integration runtime with Interactive Authoring

Unfortunately, we cannot use the Integration Runtime created by our Data Factory because it will not allow us to enable 'Interactive Authoring' with the Managed Virtual network is enabled.

To work around this, we need to create a new Azure Integration Runtime with Interactive Authoring enabled,

  1. In Azure Data Factory Studio, click on 'Manage' then 'Integration runtimes' under the 'Connections' section.
  2. Click on '+ New', a new panel is displayed on the right.
  3. In the 'Integration runtime setup' panel, select 'Azure, Self-Hosted', then click 'Continue'.
  4. Select 'Azure' under the 'Network environment', the click 'Continue'.
  5. Provide the Integration runtime name, select 'Enable' under Virtual network configuration then click 'Create' to continue.



Azure Data Factory Studio - Managed private endpoint

After the new Integration Runtime is created, we will setup the Managed private endpoint with our data lake.

  1. In Azure Data Factory Studio, click on 'Manage' then click on 'Managed private endpoints' under the Security section.
  2. Click '+ New', select 'Azure Data Lake Storage Gen2' and click 'Continue'
  3. Providing the endpoint name, select our data lake then click 'Create' to create the Managed private endpoints.



Data Lake - Approve private endpoint

For security reasons, we have to manually approve the new Private Endpoint connection. This is done to prevent any unauthorized connections.

  1. Navigate to the data lake, under 'Security + networking', click on 'Networking'.
  2. Click on the 'Private endpoint connections' tab.
  3. Select the connection with the private endpoint from 'project1' adf and click 'Approve'.

Grant ADF access to the Data Lake

To keep this setup simple, I assigned the project1 Data Factory 'Storage Blob Data Contributor' role to our data lake. This should only be done for testing purposes. We should use assign the permission using Access control lists (ACL) instead.

  1. Navigate to the data lake, click on 'Access Control (IAM)'.
  2. Click on '+ Add', then 'Add role assignment'.
  3. Select 'Storage Blob Data Contributor', then click 'Next'.
  4. Select 'Managed identity' for Assign access to, then click on '+ Select members' under Members.
  5. Select the 'project1' Data Factory, click 'Select' to continue.
  6. Click 'Review + assign'.

Testing - Data Lake connection

Finally, we can test our Data lake connection using a Linked service. A Linked service is similar to a connection string which stores the server name, authentication methods, username and passwords information.

  1. In Azure Data Factory Studio, click on 'Manage' then 'Linked services' under 'Connections' section.
  2. Click on '+ New', then select 'Azure Data Lake Storage Gen2'. Click 'Continue' to proceed to the next step.
  3. Provide the Linked service name, select the integration runtime we just created, select 'Managed Identity' for Authentication method and select the data lake.
  4. Click on 'Test connection'.
  5. If the connection is successful, it will show a green checkmark with 'Connection successful'.



References

Summary

Azure Data Factory with Managed Virtual Network provides secure connections to the many Azure Services like Data lake, Synapse, Key Vaults with only a few steps. With Private endpoint, we are able to keep all the networking traffic on the Microsoft backbone and restrict it from the public internet.

We can expand this solution to access On-Prem databases without the use of the Self-hosted integration runtime gateway. I have included a link in the Reference section above.

Happy Learning!