Basic Components in Azure Data Factory

Azure Data Factory

Azure Data Factory (ADF) is an orchestration or integration service provided by Microsoft Azure that has support integrating with 90+ different connectors.

If you are planning to migrate your project to Azure, then this is one of the most important and critical services that needs to be incorporated into the architecture.

It also supports the migration of on-premises deployed SSIS packages to Azure, as well as complex transformations using Data Flows.

Azure Data Factory

This article focuses on key features that help to successfully build an ADF pipeline:

1. Pipeline

A pipeline is a kind of workflow that you create where all the processes run. You can schedule these workflows to execute automatically based on the time that you want.

2. Activities

All the tasks that run inside a pipeline are called activities. There are different types of activities for each purpose. If you want to copy data from a source to a sink then you need to use Copy Activity, similarly, for executing a stored procedure in SQL, you need to use the stored procedure activity. I will be creating a detailed article explaining each ADF activity in the future.

3. Dataset

It is a kind of view that is created on your data which needs to be passed to the activities. For example, if you want to copy data from a table in Oracle to a file in CSV, you will have to create two datasets, one of Oracle type and one of CSV type which when passed on to the copy activity will operate.

4. Linked Service

Considering the previous example of copying data from an Oracle database table to a CSV file, Linked Service is the connection string of the Oracle database as well as of the storage where the file will be saved, for instance, Azure Data Lake Storage container.

5. Integration Runtime

It is a computing infrastructure that acts as a bridge of communication between two sources. If we consider the Oracle table copy example, then Oracle is an on-premises database whereas Azure Data Lake container is on the cloud, to ensure communication between on-premises and cloud, you need to establish a gateway at the on-premises end so that it can be linked to the ADF in question, Integration Runtime (IR) helps you to do that. There are 3 different types of IRs: Auto-resolve IR, Self-Hosted IR, and SSIS IR.

6. Trigger

Once the pipeline development is completed, it can be scheduled to run at a set frequency as per the requirements. Triggers help you to do that. There are different types of triggers available of which schedule trigger is the simplest and most used one. We will learn in detail about these triggers in future articles.