Azure Data Factory

Ojash Shrestha
3y
6.7k
0
4

Article

Azure Data Factory

In this article, we’ll learn about datasets, the JSON format they are defined in and their usage in Azure Data Factory pipelines. The article contains the sample of dataset in Data factory with its properties well described. We also learn about the types of Datasets and the data stores that are supported by the Data Factory is listed. The tools to create datasets are listed too and the differences of the versions of Data Factory are well differentiated in tabular format. Moreover, the naming rules are also discussed in detail and a brief introduction to CI/CD for Azure Data factory

One or more pipelines are supported by the data factory. A pipeline can be defined as the logical grouping of different activities which all together perform a task where the activities in the pipeline define the actions that needs to be performed on the data.

Dataset can be understood as the named view of the data which refers to the data that is to be used for the activities as inputs and outputs. Datasets are responsible to identify the data that are present in multitudes of data stores for instance, tables, files, documents, and folders. To understand deeper, we can take the reference of the Azure Blob. The Azure Blob dataset specifies the folder and blob container in the Blob storage of Azure through which data is read by the activity.

Linked Service needs to be created before creating a dataset in order to link data store to the data factory. Linked services define the connection information which are needed to connect to the external resources for the Data Factory. Storage account are linked to the data factory through the linked service of Azure Storage. The input blobs that need to be processed are present in the Azure Storage account and the folder and container are represented by the Azure Blob dataset.

The relationship between dataset, activity, pipeline, and linked service present in the Data Factory is shown in the diagram below,

Activity

Activity refers to the task that is performed on the data. The activities are used inside the Azure Factory Pipelines (ADF). The ADF pipelines are basically a group of one or more activities. For Instance, Creating ADF pipeline to perform ETL enables multiple activities such as extracting data, transforming data and loading data to data warehouse. Examples of activities are hive, stored proc, copy, map reduce, and so on.

Hive – The Hive is an HDInsight activity which executes Hive queries based on HDInsight cluster on Linux and windows that is used to analyze and process structured data.

Stored Proc – Stored Procedure in data factory pipeline helps to invoke a SQL Server Stored procedure. Azure SQL Database, Azure Synapse Analytics, SQL Server Database are some of the data stores where stored proc can be used.

Copy – The Copy activity helps copy the data from source location to the destination location. Numerous data store locations such as NoSQL, Files, Azure Storage and Azure DBs are supported by Azure.

Dataset in Data Factory

The JSON code below defines the dataset in the Data Factory.

{
    "name": "<name of dataset>",
    "properties": {
        "type": "<type of dataset: DelimitedText, AzureSqlTable etc...>",
        "linkedServiceName": {
            "referenceName": "<name of linked service>",
            "type": "LinkedServiceReference",
        },
        "schema": [],
        "typeProperties": {
            "<type specific property>": "<value>",
            "<type specific property 2>": "<value 2>",
        }
    }
}

The properties of the above JSON are well described in the table below.

Property	Description	Required
name	The Name of the Dataset.	Yes
type	It is the type of Dataset. One of the types supported by the Data Factory must be specified.	Yes
schema	The physical data type and shape is represented by the Schema of the Dataset.	No
typeProperties	Each type has varied type properties.	Yes

Types of Datasets

Multitudes of various types of datasets are supported by the Azure Data Factory which depends on the data stores that are to be used. Connector Overview listen to the data stores that are supported by the Data Factory.

The following JSON shows the DelimitedText which is set for the Delimited Text dataset.

{
    "name": "DelimitedTextInput",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "DelimitedText",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "fileName": "input.log",
                "folderPath": "inputdata",
                "container": "adfgetstarted"
            },
            "columnDelimiter": ",",
            "escapeChar": "\\",
            "quoteChar": "\""
        },
        "schema": []
    }
}

The list of data stores that are supported by the Data Factory is as follows,

Category	Data Store
Azure	Azure Blob Storage Azure Data Lake Storage Gen1 Azure SQL Database Azure Synapse Analytics Azure Cosmos DB (SQL API) Azure Table Storage Azure Cognitive Search Index
Databases	SQL Server MySQL PostgreSQL Oracle Amazon Redshift DB2 SAP Business Warehouse SAP HANA Sybase Teradata
NoSQL	Cassandra MongoDB
File	FTP File System Amazon S3 HDFS SFTP
Others	Generic HTTP Generic OData Generic ODBC Web Table (table from HTML) Salesforce

Learn more about Data Transformation using Azure Data Factory from this video,

Creating Datasets

Datasets can be created using tools or SDKs such as Azure Resource Manager Template, Azure Portal, Powershell, REST API, and .NET API.

Moreover, there are a few differences between the Version 1 Datasets and the Current Version Datasets of Data factory.

Data Factory Version 1 Datasets	Data Factory Current Version Datasets
Use of external property.	Replaced by Trigger.
Use of Policy and Availability properties.	Start time depends on triggers for pipelines.
Scoped Datasets are supported.	Scoped Datasets are deprecated.

Naming Rules

The various naming rules for Data Factory artifacts are listed in the table below.

Name	Uniqueness of Name	Checks for Validation
Data Factory	Case-insensitive. Eg. Df and DF refer to the same data factory. The names are unique across Azure platform.	One Azure Subscription is tied to only one data factory. The initials of object names should be number or a letter and only accepts alpha numeric values and the dash (-) character. The dash (-) character must be between an alphanumeric character. The permission of consecutive dash is null and void. Only 3-63 characters long names are accepted.
Linked services/Datasets/Pipelines/Data Flows	Case-insensitive. Eg. ls and LS refer to the same linked service or datasets or pipeline. It is unique only within a data factory.	Must be initialized with letter for the object name. “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,”\” - These characters a unaccepted. The dash (-) character is not accepted too.
Integration Runtime	Case-insensitive. Eg. lR and ir refer to the integration runtime.	Can contain alphanumeric values and the dash (-) character. Must contain an alphanumeric value in the first and last character. The dash (-) character must be between an alphanumeric character. The permission of consecutive dash is null and void.
Data Flow Transformations	Case-insensitive. Eg. lR and ir refer to the same Data Flow transformations. It is unique only within a data flow.	Can only contain alphanumeric values. Must be initialized with an alphabet character.
Resource Group	Case-insensitive. Eg. RG and rg refer to the same resource group. The names are unique across Azure platform.
Pipeline Parameters and Variable	Case-insensitive. Eg. PV and pv refer to the same variable and pipeline parameters.	Due to backward compatibility reasons, the validation check on parameter names and variable names is limited to uniqueness.

CI/ CD in Azure Data Factory

Continuous Integration refers to the practice of testing every change that is made to the codebase automatically as early as possible and follows the testing that occur during continuous integration and thus pushes the changes to the production system or staging.

Here, in the Azure Data Factory, CI/CD refers to moving of the Data Factory pipelines from one particular environment such as development, testing and production to another. Azure Resource Manager templates is utilized by the Azure Data Factory in order to store various ADP entities like pipelines, datasets and data flow’s configuration.

There are basically two suggested methods for promoting the data factory to another environment.

Using Data factory UX with Azure Resource Manager to Manually upload the Resource Manager Template
Using integration of Data Factory with Azure Pipeline to make Automated Deployment

Conclusion

Thus, in this article, we learned about Azure Datasets with a detailed understanding of Azure Data Factory, its pipelines, the sample of dataset in Data Factory with the properties of the JSON sample. Then we learnt about the types of Datasets and a sample DelimitedText dataset. We also learnt about the differences in the versions of Datasets of Data Factory. We then learnt about the Naming Rules for the Data Factory artifacts, the uniqueness of the name and the validation checklist. Finally, we learnt about CI/CD in Azure Data Factory in brief.