Parameterize Pipelines And Datasets In Azure Data Factory With Demo

Introduction

In continuation of our previous article, we will look at how could we use parameterization in datasets and pipelines. We will also implement a pipeline with simple copy activity to see how and where we can implement parameters in the Azure data factory.

Consider a scenario where you want to run numerous pipelines with many datasets for your project. At times this becomes very hard to manage multiple linked services one for server name, database with different credentials and it becomes a tedious process. What if there is a way to use a single linked service or pipeline for all requirements, sounds good right? This is what exactly parameterization does. It helps us to enter the server name, database name, credentials, etc., dynamically when we are running the pipeline thereby helping us to reuse instead of creating one for each request.

Parameterization in Datasets

Let’s look at a demo on how to get used to the parameterization in datasets. In my previous article, I have discussed how to use parameterization in linked services. On that, I have created a parameter for database name which obviously means that I can give different DB names dynamically over the run time instead of creating a linked service for each requirement.

Now go to azure data factory studio and create a new dataset like given below.

I have a SQL database in my azure subscription which I will use for this demo. The database can be selected as per your choice within the given list.

After creating the dataset, I have selected the linked services which I have created earlier and then I am creating two parameters through the parameters tab, one for dbName and tableName. Remember the dbName parameter is which we created when creating Linked services earlier hence I have referred here to add to the table name as well. The default values can be left blank as we should be able to pass the values dynamically when running the pipeline.

Once this is done you can call your dataset anywhere in the pipeline and pass on values at run-time.

Parameterization in Pipelines

The demo task we are looking at today is to copy records from one table to another in a SQL database. We will create a new pipeline and then click and drag the ‘Copy data’ task from ‘Move & transform’. There will be options with multiple tables for configuring source and sink(destination), settings, etc. once when you click the copy data task.

Other than all the tabs provided here, the tabs we will work on are source and sink. Obviously, we have to call the dataset that we have created in the previous section into the ‘Source’ tab. We can see the parameters that we used when creating the dataset will be listed out as soon as we select the source dataset -dbName and tableName in this case.

A similar setup goes for the destination too, the ‘Sink’ tab. Since this is a demo I am copying data from one table to another in the same database, you can create the sink to be another SQL database or any other destination as per your choice. We can see the sink tab lists out

By this time if you are wondering what value has been given in the parameters for both source and sink, let me explain. Since I am going to copy records from one table to another from the same database it is very simple and easy to hardcode the tableName and dbName but that will restrict me to use the hardcoded value only and cannot be reused with any other tableName or dbName. We can overcome this by creating parameters on the fly at the pipeline level.

I have created three parameters SourceDBPipeline and SourceTable(dbName and tableName from dataset) for source and DestTable (destination table where the records will be copied to) for sink. So now all the parameters has been created and set how are we going to implement it? Where these parameters are going to be called out from and at which step would we be able to enter the database and table name? Here comes the final part…

Trigger

I have discussed triggers in one of my previous blogs that it is a scheduler or mechanism where we could run our pipeline. Here we are going to call out the parameters just when we trigger this pipeline.

We can see the trigger button at the top center above the pipeline window, Just click it and create a new trigger.

Once when you have entered the details and click OK and you will land into a new window called ‘trigger run parameters’ where we will be hardcoding the database name or table name dynamically as per our requirement whenever we run the pipeline.

After entering the details click OK and then publish the pipeline to run it. Remember to debug it before you run.

Demo

I have created a SQL database table called Employee and filled it up few dummy rows just for this demo. This will be acting as a source table and the records from these tables will be copied over the new table using the pipeline.

I have also created another blank table Employee2 as the destination table. Let's trigger the pipeline and see if the records are copied over.

We can see the pipeline succeeded.

See the records have been copied to the new table Employee2

Summary

In this article, we saw a demo of how end-to-end parameterization could be implemented on both datasets and pipelines in a practical scenario hope this will be helpful. We will look at more azure data topics in the coming weeks.

References

Microsoft Azure data factory official documentation.