Filter Realtime Error Rows From CSV To SQL Database Table In Azure Data Factory - Part One

Vinodh Kumar
3y
6.1k
0
3

Article

Azure Data Factory is a tool with tremendous capabilities when it comes to ETL operations. It has many features that would help the users to cure and transform the data that we load into it. The developers or the users face many real-time issues when performing their ETL operations one such common yet unavoidable scenario mostly faced is handling the error or incomplete data. We cannot pre-filter the error rows inside the ocean of data that we load into the ETL, hence we look for a method to pick those error or truncated rows from the raw data and store them in a separate database table or file on top of which a separate process will be carried out to decide on whether to cure and accept or simply delete those rows.

In this article, we are going to see how we could save the error rows into a separate SQL database table to review later. I have loaded my demo data files into blob storage which contains the sales figure of a sample company from Jan-Mar 2021.

We could see that my data has a lot of error rows in-between which we are going to segregate into separate SQL database tables.

I have created two identical SQL tables which would store the correct rows and error rows respectively.

One thing to note when creating the table is though both of them contains same structure and column name, I have made the ‘Date2’ column in ‘Sales_table_ErrorRows’ table as varchar instead of date type. This is because we are routing only the error fields into that column so if you create the structure with datetype, it won’t accept the incoming ambiguous data.

Open azure data factory studio and as a first step, I am going to create a new data flow as below. We have to add the source file

Once when I selected the new dataset option following are the steps I performed to link the data source into the dataflow.

Dataset - New - Azureblob Storage - Delimited text(CSV) - Connect to your linked server or create new if not available already.

Now that data source has been created, we need to create destination dataset on which the data is going to get stored. Here in our demo, it is the SQL database tables. Link to the SQL database and the first table that you want the right data to be stored into. I have created a new azure SQL database for this purpose, you can use the existing one or the one you already have.

Checking the schema of the table we just mapped to make sure that we have the ‘Sales_table’ first.

Let’s jump to the data flow to get started as we now have all the pre-requisites met. The next step is to add link to our existing data flow but before that we have to perform the validation and pick the error rows out which is the objective of this demo. As a first step, I will click on the ‘SalesData’ dataflow and use the ‘Data Preview’ tab to view the rows. One thing to note here is we must have ‘Data flow debug’ enabled in order for the data preview to work.

The highlighted field in the data preview below is one of the error rows which we are going to segregate into a separate table.

Now back to data flow where we can add transformation flow to filter out the data and divert the valid rows into ‘Sale_tables’ and the error rows into ‘Sales_table_ErrorRows’.

Click the ‘+’ button to add a ‘Conditional Split’ transformation into the data flow. Once opened the below pane would list out the fields where you can keep the default options for the first 3 fields. The split condition is the very important field which we are going to focus on.

The first row of the split condition is where we will identify the error rows based on the condition we are going to specify and the highlighted visual expression builder will be useful in building up our condition to check the Date value in our data is valid or junk row.

There are two conditions provided with the split condition, let’s look at the first condition. Click on the input schema to list out the columns from your source and the Date2 is our date column which we need to validate and split out the error rows.

We can straightforward convert our date value using “toDate” expression but since our column data might contain error or incomplete rows, it is better to include “isNull” so that it will return True if it resolves Null and False for any other values. Now save and finish.

Continued with Part Two….