How To Setup Git Source Control In Azure Data Factory

Introduction

In this article, we will explore how to connect your git repository with Azure Data Factory. This will allow us to track changes and manage our code throughout the development lifecycle.

Challenges

In our earlier article, we have successfully deployed our Azure Data Factory pipelines. All the source codes are stored inside Data Factory and any saved changes will be live. The engineering team needs a development area which allows them to save partial developed code and a testing area separate from the live system.

We can achieve this by enabling git integration in Azure Data Factory. Another added advantage is to have a code repository to store our source code to track and review any code changes.

Let's go over some concepts first:

  1. Live mode: All saved changes are 'live'. All the source code is stored in Data Factory and it does not have a source control repository. This mode does not allow saving of partial code because all changes are live. For development environment, this mode is not recommended.
  2. Feature branch: A git branch created for development. The pipelines in this branch can only be executed through debug and not triggers. After development is completed, the code needs to be merged into the Collaboration branch.
  3. Collaboration branch: A git branch that stored the JSON code. All the files in this branch is ready for testing.
  4. Publish branch: The ARM templates of the live code in the Collaboration branch. By default, this is adf_publish.

Creating git repo and branches

Before we can dive into the git integration in ADF, let's create our git repo. This can be done in Azure DevOps or GitHub. I have created the following:

  1. Repo name: adf-repo.
  2. Create the collaboration branch: dev.
  3. Create the publish branch: adf_publish_dev.

By creating a 'dev' specific collaboration and publish branches, it gives us the flexibility to add an uat and production environments in the future.

Integrating git with ADF

In this example, I will be using Azure DevOps. If you are using GitHub, the steps will be similar.

  1. Open Azure Data Factory Studio, select 'Manage'. Under 'Source control', click on 'Git configuration'.

  2. For Repository type, select 'Azure DevOps Git' and select the Azure Active Directory tenant name. Click 'Continue'. 

  3. Provide the Azure DevOps information then click 'Apply':

    • Select the DevOps project name and repository name.
    • Collaboration branch, select 'dev'.
    • Publish branch, select 'adf_publish_dev'.
    • Since we have pipelines in our Data Factory, we want to import all the source code into the repo. We do this by 'check' the 'Import existing resources to repository and select 'dev' as the branch.
  4. Select the Working Branch 'dev', then Click 'Save'.

  5. Let's check the Azure DevOps Repository to see if our source code is imported.  Make sure we select the 'adf-repo' and 'dev' branch.

  6. Let's go back into Azure Data Factory Studio. Click on 'Author'.

    • We can see the 'dev' branch is selected by default.
    • For any new development, we want to do it in a feature branch. We can do this by clicking on 'New branch'. This will allow you to save your partial development. It is important to remember that after testing the pipeline, we need to 'Create pull request' to bring the code back to the 'dev' branch.
    • Finally, we can 'Publish' all our changes. This will push the pipelines in 'dev' branch into 'adf_publish_dev' branch and make the new changes 'live'.

Summary

By enabling git integration in Azure Data Factory, we have the gain the ability to:

  • Have a source control to manage our source code.
  • Develop in a feature branch. Save our partial development pipeline without impacting the 'live' system.

This approach enables us to take advantage of continuous integration and delivery (CI/CD). I have provided a link in the References section if you are interested in adding it into your solution.

Happy Learning.

References