Delete Files Based On Excel Column Name In Azure Data Factory

Introduction

In this article, we are going to discuss how to pick and delete only specific files from the ADLS directory based on filenames present inside the excel file. We are going to see this with a real-time use case.

File deletion

Recently I came across a requirement for file deletion in ADLS. Azure Data Factory’s delete activity is enough to complete this task, but the tricky part is the activity should delete only the objects that are present in the excel 'File Name' column. Let us see how this was achieved with a practical demo.

For this demo, I am going to upload few sample files including the ones which I am trying to delete into a folder in my ADLS Gen2 directory. Apart from this I must create an excel file that includes the filenames to be deleted present in column B; like below. One common question that might arrive here is how we can pick exactly the file name column if there are multiple columns present in the excel sheet. This can be achieved using the range option during creation of the source dataset which we will see in the next section.

Delete files based on excel column name in Azure Data Factory

File names in column B

Delete files based on excel column name in Azure Data Factory

Steps

Click and drag Lookup activity into your ADF pipeline canvas which is going to read out the file names that you want to be deleted.

Delete files based on excel column name in Azure Data Factory

Next go the settings tab and create the dataset which contain the excel file with the filename column. In the source dataset option click on +New and set file type as excel, then you will arrive at the following step.

Delete files based on excel column name in Azure Data Factory

  1. Path of the excel file which will function as source with all the filenames to be deleted.
  2. Select the sheet name if you excel sheet has more than one.
  3. Range is the one which you can mention to pick the cells starting and ending including the row and column name. In my case if you could scroll up the excel sheet screenshot you can see the filenames are present from column B cell 2 through 10, hence I have marked it as a range. In this way it will read only the specific data ignore all other which we do not need.

In range field, I have limited It to certain rows only as I know which column and up until which row the file names would be and that it will not be changing. In case you are going to append or delete the file names into the excel in future you can leave the range field open as ‘B2’ instead of ‘B2:B9’.

Now after the dataset is set, we are going to pass the output of the lookup activity into the foreach activity. Once under the foreach ‘settingsàitems’ tab enter the following expression for passing the output from lookup activity.

@activity('SourceNameFileLookup').output.value

Delete files based on excel column name in Azure Data Factory

Inside the foreach activity, create a delete activity and set the dataset pointing towards the folder where you want to delete the files from like below.

Delete files based on excel column name in Azure Data Factory

So far we have setup pipeline but did not pass the file name from the foreach to the delete activity. With the current setup the delete activity will delete the entire folder containing the files that we are searching for. Hence we have to make sure we have passed the filename variable to the filename parameter in the delete activity to make sure only the files are deleted.

In the dataset used by the delete activity, create a string variable called ‘filename’ and pass it on to the filename field of the dataset File path.

Delete files based on excel column name in Azure Data Factory

Delete files based on excel column name in Azure Data Factory

In the source settings of the delete activity, I am calling out the column name of the excel file that we are looking up initially. The column name is ‘Prop_0’ which I chose to leave it as a system generated one, in case your lookup file has headers you can mention in the dataset and call out here.

The pipeline is now ready, lets run it to see if we are getting the files deleted.

Delete files based on excel column name in Azure Data Factory

The pipeline has been triggered and completed successfully.

Delete files based on excel column name in Azure Data Factory

We can see it has deleted all the files compared to out input in the excel column and below are the only files those are left after the activity.

Delete files based on excel column name in Azure Data Factory

Summary

Most of the projects has this requirement to clean-up specific set of stale files to reduce cloud cost from the blob based on name/type referring them directly from a file rather than input them manually. The above steps would come handy in such scenarios.

Hope this helps, let me know for any questions.


Similar Articles