Azure Data Factory

Azure Data Factory

In this article, we’ll learn about datasets, the JSON format they are defined in and their usage in Azure Data Factory pipelines. The article contains the sample of dataset in Data factory with its properties well described. We also learn about the types of Datasets and the data stores that are supported by the Data Factory is listed. The tools to create datasets are listed too and the differences of the versions of Data Factory are well differentiated in tabular format. Moreover, the naming rules are also discussed in detail and a brief introduction to CI/CD for Azure Data factory 

One or more pipelines are supported by the data factory. A pipeline can be defined as the logical grouping of different activities which all together perform a task where the activities in the pipeline define the actions that needs to be performed on the data.

Dataset can be understood as the named view of the data which refers to the data that is to be used for the activities as inputs and outputs. Datasets are responsible to identify the data that are present in multitudes of data stores for instance, tables, files, documents, and folders. To understand deeper, we can take the reference of the Azure Blob. The Azure Blob dataset specifies the folder and blob container in the Blob storage of Azure through which data is read by the activity.

Linked Service needs to be created before creating a dataset in order to link data store to the data factory. Linked services define the connection information which are needed to connect to the external resources for the Data Factory. Storage account are linked to the data factory through the linked service of Azure Storage. The input blobs that need to be processed are present in the Azure Storage account and the folder and container are represented by the Azure Blob dataset.

The relationship between dataset, activity, pipeline, and linked service present in the Data Factory is shown in the diagram below,

Activity

Activity refers to the task that is performed on the data. The activities are used inside the Azure Factory Pipelines (ADF). The ADF pipelines are basically a group of one or more activities. For Instance, Creating ADF pipeline to perform ETL enables multiple activities such as extracting data, transforming data and loading data to data warehouse. Examples of activities are hive, stored proc, copy, map reduce, and so on. 

Hive – The Hive is an HDInsight activity which executes Hive queries based on HDInsight cluster on Linux and windows that is used to analyze and process structured data.

Stored Proc – Stored Procedure in data factory pipeline helps to invoke a SQL Server Stored procedure. Azure SQL Database, Azure Synapse Analytics, SQL Server Database are some of the data stores where stored proc can be used.

Copy – The Copy activity helps copy the data from source location to the destination location. Numerous data store locations such as NoSQL, Files, Azure Storage and Azure DBs are supported by Azure.

Dataset in Data Factory 

The JSON code below defines the dataset in the Data Factory.

{
    "name": "<name of dataset>",
    "properties": {
        "type": "<type of dataset: DelimitedText, AzureSqlTable etc...>",
        "linkedServiceName": {
            "referenceName": "<name of linked service>",
            "type": "LinkedServiceReference",
        },
        "schema": [],
        "typeProperties": {
            "<type specific property>": "<value>",
            "<type specific property 2>": "<value 2>",
        }
    }
}

The properties of the above JSON are well described in the table below.

Property  Description  Required 
name  The Name of the Dataset. Yes 
type  It is the type of Dataset. One of the types supported by the Data Factory must be specified. Yes 
schema  The physical data type and shape is represented by the Schema of the Dataset.  No 
typeProperties  Each type has varied type properties. Yes 


Types of Datasets 

Multitudes of various types of datasets are supported by the Azure Data Factory which depends on the data stores that are to be used. Connector Overview listen to the data stores that are supported by the Data Factory.

The following JSON shows the DelimitedText which is set for the Delimited Text dataset.

{
    "name": "DelimitedTextInput",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "DelimitedText",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "fileName": "input.log",
                "folderPath": "inputdata",
                "container": "adfgetstarted"
            },
            "columnDelimiter": ",",
            "escapeChar": "\\",
            "quoteChar": "\""
        },
        "schema": []
    }
}

The list of data stores that are supported by the Data Factory is as follows,

Category  Data Store 
Azure  Azure Blob Storage  
Azure Data Lake Storage Gen1   
Azure SQL Database   
Azure Synapse Analytics   
Azure Cosmos DB (SQL API)   
Azure Table Storage   
Azure Cognitive Search Index   
Databases SQL Server   
MySQL   
PostgreSQL   
Oracle   
Amazon Redshift   
DB2   
SAP Business Warehouse   
SAP HANA   
Sybase   
Teradata   
NoSQL Cassandra  
MongoDB   
File FTP  
File System   
Amazon S3   
HDFS   
SFTP   
Others Generic HTTP   
Generic OData  
Generic ODBC  
Web Table (table from HTML)   
Salesforce   

Learn more about Data Transformation using Azure Data Factory from this video,

Creating Datasets 

Datasets can be created using tools or SDKs such as Azure Resource Manager Template, Azure Portal, Powershell, REST API, and .NET API. 

Moreover, there are a few differences between the Version 1 Datasets and the Current Version Datasets of Data factory.

Data Factory Version 1 Datasets  Data Factory Current Version Datasets 
Use of external property. Replaced by Trigger.
Use of Policy and Availability properties. Start time depends on triggers for pipelines.
Scoped Datasets are supported.  Scoped Datasets are deprecated.


Naming Rules 

The various naming rules for Data Factory artifacts are listed in the table below.

Name  Uniqueness of Name  Checks for Validation 
Data Factory Case-insensitive. Eg. Df and DF refer to the same data factory.  The names are unique across Azure platform. One Azure Subscription is tied to only one data factory. The initials of object names should be number or a letter and only accepts alpha numeric values and the dash (-) character. The dash (-) character must be between an alphanumeric character. The permission of consecutive dash is null and void. Only 3-63 characters long names are accepted.
Linked services/Datasets/Pipelines/Data Flows Case-insensitive. Eg. ls and LS refer to the same linked service or datasets or pipeline. It is unique only within a data factory. Must be initialized with letter for the object name. “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,”\” - These characters a unaccepted. The dash (-) character is not accepted too.
Integration Runtime Case-insensitive. Eg. lR and ir refer to the integration runtime. Can contain alphanumeric values and the dash (-) character. Must contain an alphanumeric value in the first and last character. The dash (-) character must be between an alphanumeric character.  The permission of consecutive dash is null and void. 
Data Flow Transformations Case-insensitive. Eg. lR and ir refer to the same Data Flow transformations.  It is unique only within a data flow.  Can only contain alphanumeric values. Must be initialized with an alphabet character.
Resource Group Case-insensitive. Eg. RG and rg refer to the same resource group. The names are unique across Azure platform.   
Pipeline Parameters and Variable Case-insensitive. Eg. PV and pv refer to the same variable and pipeline parameters. Due to backward compatibility reasons, the validation check on parameter names and variable names is limited to uniqueness.


CI/ CD in Azure Data Factory

Continuous Integration refers to the practice of testing every change that is made to the codebase automatically as early as possible and follows the testing that occur during continuous integration and thus pushes the changes to the production system or staging.

Here, in the Azure Data Factory, CI/CD refers to moving of the Data Factory pipelines from one particular environment such as development, testing and production to another. Azure Resource Manager templates is utilized by the Azure Data Factory in order to store various ADP entities like pipelines, datasets and data flow’s configuration.

There are basically two suggested methods for promoting the data factory to another environment.

  • Using Data factory UX with Azure Resource Manager to Manually upload the Resource Manager Template 
  • Using integration of Data Factory with Azure Pipeline to make Automated Deployment

Conclusion

Thus, in this article, we learned about Azure Datasets with a detailed understanding of Azure Data Factory, its pipelines, the sample of dataset in Data Factory with the properties of the JSON sample. Then we learnt about the types of Datasets and a sample DelimitedText dataset. We also learnt about the differences in the versions of Datasets of Data Factory. We then learnt about the Naming Rules for the Data Factory artifacts, the uniqueness of the name and the validation checklist. Finally, we learnt about CI/CD in Azure Data Factory in brief.