Change Data Capture In Azure Synapse Analytics

Introduction

In today’s blog, we will look into the new feature called 'Change Data Capture' that has been available in preview now for both azure Synapse Analytics and the Data Factory.

Change Data Capture

In data terminology Change Data Capture or simply called CDC is a method to track and pick only the data that has been changed from the last known point of time. CDC is a feature that was already available in the SQL Server for finding the changed records in a database table, hence the database users would be somewhat familiar with this feature. In synapse this is not limited to only databases, you can also use this at the file level to detect only the changed files from ADLS blob storage. There are no special pipelines or activities required for CDC and the enabling options are also very simple and quick to configure. The CDC works based on ‘Checkpoint’ in the background which will make sure the changed records are properly read and copied and there are no overlaps or duplicates, more of it has been discussed in the later part of this article.

The easiest way to get started with CDC is to use the ‘+’ symbol in the designer canvas like how you will create a new pipeline. Once clicked this will walk you through the configurations to advance and complete your source and destinations and apply additional transformations if required.

Change Data Capture in Azure Synapse Analytics

Checkpoint

The Checkpoint does the main function which will make sure that only the changed data from the source is read compared to the last pipeline run. The checkpoint is tagged to both the pipeline and activity name and if you rename them the checkpoint will be reset and must start from the beginning or from now, though there is a workaround for that called ‘Checkpoint key’ in data flow activity.

Types

CDC for database records

The native CDC is the basic form or recommended way for getting the changed data from the database tables. It has a low burden on the source as the CDC feature from Synapse analytics or Data factory will extract the changed data for processing.

CDC for file-based storages

From mapping data flow options are provided to get only the new or updated files in the simplest way to get the delta load of the files. The only limitation at this point of time with file-based copy is it can detect new or updated files, but not deleted files or folders.

Billing

The billing will be calculated only for the general-purpose data flow when the data is being processed. The CDC will allow you to setup the wake-up time latency at required latency intervals and the billing time will be for only the time when it will look for changed data from your source dataset that has to be picked upon.

Summary

This article is for the basic understanding of Change Data Capture(preview) feature. The practical use case with steps and screenshots will be published in the upcoming article.