ETL Using Azure Data Factory

Azure Data Factory is a service which has been in the Azure ecosystem for a while. But recently, with version 2 of the service, Azure is reclaiming the integration space. Data factory in simple words can be described as SSIS in the cloud (this does not do justice to SSIS, as SSIS is a much more mature tool compared to Data factory. But Data factory is the same idea).

Since Microsoft is showing more inclination towards a cloud-first strategy,  they are moving most of their flagship products to the cloud, such as Office, SQL Server, and now SSIS in the form of Data factory.

In this short demo, I will show you how you can use Data Factory to process JSON files and upload them in SQL Azure.

  1. Login to Azure portal and create a new Data Factory

    ETL Using Azure Data Factory

  2. Click author and monitor, this will open DF User Interface.

    ETL Using Azure Data Factory

  3. Select the Author menu item to get the designer.

    ETL Using Azure Data Factory

  4. Create a source and destination dataset. In our demo source is a JSON file which we will place in blob storage and the destination is SQL Azure table.

    ETL Using Azure Data Factory

  5. I have created a simple user.json file with sample user data as below, and uploaded it to Blob storage.

    ETL Using Azure Data Factory

    ETL Using Azure Data Factory

  6. I have also created a table in SQL Azure database with similar fields as in JSON file.

    ETL Using Azure Data Factory

  7. Now, let us create the source dataset. From the long list of dataset options as below, we will select blob as our user.json file is available there.

    ETL Using Azure Data FactoryETL Using Azure Data Factory

  8. For the Link Server option we will have to provide details to connect to our blob storage where a user.json file is placed.

    ETL Using Azure Data Factory

  9. Setup the connection tab, as below. Column name and JSON expression will be auto-generated as below,

    ETL Using Azure Data Factory

  10. Go to the schema tab and click Import Schema. This will expose the schema to copy activity in a pipe we will see later,
    ETL Using Azure Data Factory
  11. Next, we will create a destination dataset. The same way as source dataset, we will select SQL Azure to set up a destination dataset.

    ETL Using Azure Data Factory
     
  12. Configure destination dataset; i.e., set the connection details so that DF can insert data extracted from json file into SQL Azure table.

    ETL Using Azure Data Factory

  13. Select the target table in the database where DF will push data from the connection tab. Go to schema tab to import table schema. This will be used by copy activity to map the schema we imported in step 10 to Table schema.
    ETL Using Azure Data Factory
  14. Create a new pipe with Copy Activity.

    ETL Using Azure Data Factory
     
  15. Configure the source for the copy activity.

    ETL Using Azure Data Factory

  16. Configure destination (Sink) for the copy activity.

    ETL Using Azure Data Factory
     
  17. Configure Mapping for copy activity

    ETL Using Azure Data Factory

  18. Click Publish to save the configurations,

    ETL Using Azure Data Factory

  19. Once publish is completed, click the Debug button to run the data factory. During execution, DF will read the Json, buffer the data, and insert the records in a batch to the user table.

    ETL Using Azure Data Factory
  20. Once completed a green tick will appear against the RunId.

    ETL Using Azure Data Factory
  21. Now, we can verify the table in SQL Azure db to validate data. You can see the data available in JSON file is now ported to SQL Azure table.

    ETL Using Azure Data Factory

Summary

Data Factory is an awesome tool to execute ETL using a wide range of sources such as Json, CSV, flat file, etc., to a wide range of destinations such as SQL Azure, Cosmos DB, AWS S3, Azure Table storage, Hadoop, and the list goes on and on. I highly recommend Data factory to be considered for any ETL use case. Also, there are a lot of options to plan error reporting and notification.