How To Setup Virtual Network Integration On Azure Data Lake Storage

Introduction

In this article, we will learn how to apply networking control through Virtual network integration to the Azure Data Lake Storage created from the last article.

By setting up Virtual network integration, we allow services on the Virtual network and subnet to connect and access the data lake without adding an IP exception to the firewall. It is important to note that the connection between the data lake and the Virtual network is through the public endpoint and IP address.

Challenges

In the last article, we created our data lake storage with firewall enabled. To manage and configure the data lake, we had to add the admin IP addresses to the firewall exceptions. This is a security concern as many of us are working from home and our IP addresses can change. Furthermore, we need to provide a way for Data Engineers and Data Scientists to access the data lake. We need to provide a solution without modifying the firewall exception list.

We can achieve this by implementing Virtual network integration with our data lake. To keep things simple, we will use Virtual Machines (VMs) provided by our infrastructure team in Azure to access and manage our data lake.

Tutorial

Before we dive into the tutorial, let's recap on what we have achieved in the last article:

  • Created Azure Data lake Storage through the Azure Portal.
  • Enabled infrastructure encryption, encryption-at-rest, and TLS 1.2.
  • An enabled Firewall allows trusted Azure services to access the data lake.
  • Enabled Soft delete for containers and blobs.
  • Disabled Blob public access.
  • Created custom tags for cost tracking.

Virtual network integration

  1. The infrastructure team has provisioned the Data platform admin team with virtual machines to support Cloud services. They are created on the same subnet within the Virtual Network (VNet). We need to ensure the subnet has 'Microsoft.Storage' service endpoints enabled,

    1. Select the Virtual network and the Subnet 'vm'. This will bring up a sub-panel with the subnet information.
    2. Under Service Endpoints, click on the 'Services' dropdown to expand the list.
    3. Enable 'Microsoft.Storage' is not checked and click 'Save'.
    4. Wait until the notification confirms the Service Endpoint is added before proceeding to the next step.

  2. Now we can add the Virtual Network and Subnet to our data lake,

    1. Navigate to our data lake, under 'Security & networking', select 'Networking'.
    2. Under the 'Firewalls and virtual networks' tab, under 'Virtual networks', click on '+ Add existing virtual network'. This will bring up the 'Add networks' side panel.
    3. Select the Virtual Network and Subnet where the Virtual machines are created (with service endpoint enabled), then click 'Ok' to confirm the selection,

      How to setup Virtual network integration on Azure Data Lake Storage
       
    4. To finalize our settings, click 'Save' under the 'Firewalls and virtual networks' tab. 

      How to setup Virtual network integration on Azure Data Lake Storage
  3. Now, we can login to our VM to verify our settings are working,

    1. Login to the Azure portal and navigate to our data lake.
    2. Click on the 'Storage Explorer (preview) menu item. This will open the Cloud Storage Explorer.
    3. Expend the containers to ensure you do not get an error. In my case, I created some containers to show it opens up correctly.

  4. Lastly, we can repeat the steps in the tutorial to add multiple Virtual networks and Subnets to our data lake.

References

Summary

In this tutorial, we enhanced our security by removing the need of updating the firewall rules as our IP changes. We achieved this by implementing Virtual network integration with the virtual machines in Azure.

Happy Learning!