Change Data Capture - Another Way To Implement The Incremental Load

As we saw in the previous article, which was about the definition and implementation of full load and delta load using different SSIS components, in this article I'll discuss one of the functionalities of SQL SERVER: CDC (Change Data Capture). This feature has been present since the 2008 version. This presentation will be about the 2012 version.

What is the CDC?

The CDC Control task is used to control the life cycle of change data capture (CDC) packages. It handles CDC package synchronization with the initial load package, the management of Log Sequence Number (LSN) ranges that are processed in a run of a CDC package. In addition, the CDC Control task deals with error scenarios and recovery. [Microsoft Documentation]

The objective of the CDC is to optimize the integration of data (of the ETL process) by directly requesting the modifications made to a table instead of working on the entire table and thereby increasing processing times. Among other things, it allows basic auditing and synchronization between two databases.

Change data capture is a concept that is not specific to SQL Server (it's present in other DBMS such as Postgres, Oracle…), and which consists of tracking and recovering changes to data in a table.

The CDC was implemented at the SQL Server 2008 level, but only at the database engine level, and the concept of the Log Sequence Number (LSN) was used, which makes the implementation of the CDC under SSIS more complex.

The log sequence number (LSN) value is a three-part, uniquely incrementing value. It is used for maintaining the sequence of the transaction log records in the database. This allows SQL Server to maintain the ACID properties and to perform appropriate recovery actions.

Starting from SQL Server 2012, Microsoft goes further in its approach and introduced 3 main components to use the CDC directly in SSIS:

  • A CDC Control Task: managing the life cycle of CDC packages, and in particular all the mechanics of LSNs. 
    SQL Server CDC Control task
  • A Source CDC:  reads the information from a data capture exchange table
    SQL Server CDC Control task
  • A CDC Splitter: redirects the rows according to whether they need to be inserted, updated, or deleted.
    SQL Server CDC Control task

Setup Change Data Capture on the source database

On this table, I activate the CDC. Here it is the same as under SQL Server 2008, we find the same commands.

EXEC sp_changedbowner 'sa'

/* Activate the CDC */
EXEC sp_cdc_enable_db

/* Verify if the CDC is activated */
SELECT name, is_cdc_enabled
FROM sys.databases
WHERE name LIKE 'LearningDatabase'

/* Parametrize the CDC for the table STG.Employee  */
EXEC sys.sp_cdc_enable_table
@source_schema = N'SRC'
, @source_name = N'Employee'
, @role_name = NULL
, @supports_net_changes = 1

It is important to note that, there is no necessity for tables in the source database to include a column that indicates the date and time of the last modification. This means that no structural changes are needed in order to enable CDC for extraction.

Make sure that SQL Server Agent is running as a SQL Server Agent Job is used to capture CDC data.

Change Data Capture Control Flow Task in SSIS

Starting from SSIS 2012, the CDC Control Task was introduced as the new control flow task to implement the change data capture working with the CDC enabled databases and tables.

This new feature works on controlling the life cycle of change set for both CDC marked database and table :

  • it enables the SSIS package to use CDC change set
  • it applies the transfer of data as required, and finally
  • it marks the change set as accomplished, or in case of an error it retains the change set for further analysis.

 The CDC Control Task holds the state of CDC into a package variable (defined when configuring the component), used later in CDC Data Flow components.

SQL Server CDC Control task

1. Starting the CDC for a table

Let’s start by configuring the CDC Control Task where we need to Mark CDC for Start. 

Drag and drop a CDC Control Task into the package.

SQL Server CDC Control task

And follow the configuration like below:

  1. Set a connection manager (ADO.NET Connection Manager) to the source database.
  2. Set CDC Control Operation as: Mark CDC Start
  3. Set a variable of type string for CDC State.
  4. Set the connection for the database contains state data.
  5. Set the table for storing CDC state. You can create a new table here if you don’t have any table for it already : 
    CREATE TABLE [dbo].[cdc_states] 
    ([name] [nvarchar](256) NOT NULL, 
     [state] [nvarchar](256) NOT NULL) ON [PRIMARY]
     ( [name] ASC ) 

    SQL Server CDC Control task

  6. Verify or Set the State Name values.
    SQL Server CDC Control task

We can now run the package. But what happens if the task is run successfully?

Actually since we didn’t set any actions for the change set, no data will be transferred.

The aim of this task is to set the CDC state in cdc_state table.

Note that this task with the above configuration needs to be run one and only once.

SQL Server CDC Control task

If we query the cdc_states table, we can see that the state has a timestamp portion showing the data and time of the state storage.

By definition, this state represents the state of table stored so the SSIS can recognize the very first state of the Change Data Capture, and get the range of changes afterwards

SQL Server CDC Control task

We will disable the task, as we don’t want to run it again.

2. Working with the Range of changes with CDC Control Task

The next step is creating two CDC control tasks: one for getting the range and the other for marking it. To better explain it, we need to retrieve the range of data that has updates and then mark it as processed.

  1. We need to create a new CDC Control Task and configure it exactly as we did for the CDC Control Task – Start, with changing the CDC control operation as “Get Processing Range”.
    SQL Server CDC Control task
  2. We need to create another CDC Control task same as previous one and set the CDC control operation as Mark Processed.
    SQL Server CDC Control task
  3. We need to place a Data Flow Task between these two tasks. We will leave the data flow empty as we should fill it in the next stage. The aim of using the Data Flow is to read the change set and execute the appropriate action based on the ETL load actions (delete, insert, or update)
    SQL Server CDC Control task

3. Reading the Changed Set with CDC Source and CDC Splitter

In the data flow task we have to read the changed set with the help of CDC change set table, CDC data flow components (CDC Source, and CDC Splitter), and CDC state (stored by CDC Control Tasks). CDC Source is a component that read the change set and provide it as the main output so it can be used for CDC splitter. CDC Splitter split the change set into three data set: Inserted, Deleted, and Updated outputs. For this example, I’ve used a stage table as the destination of this step to be able to write result sets into it. my state table is exactly same as the source table plus single column for Status. I’ll fill the status column in Derived Column depends on the change happened on the data row.

Moving to the data flow task, we need to read the changed set using :

  • the CDC change set table
  • the CDC data flow components (CDC Source, and CDC Splitter)
  • the CDC state (stored by CDC Control Tasks)

The CDC Source reads the change set and supply it as the input which will be used by the CDC Splitter

The CDC Splitter splits the change set into 3 branchs: Inserted - Updated - Deleted

  1. Create a CDC Source component in the data flow.
  2. Set ADO.NET connection manager to the source database which has CDC enabled.
  3. Set CDC enabled table name
  4. Verify the CDC processing mode to be Net :  the CDC processing mode is set to Net in order to capture the net changes rather than capturing all records
  5. Set CDC state variable the same variable that we’ve used in CDC Control Tasks.

SQL Server CDC Control task

After that, we created a CDC Splitter component after the CDC Source. (it doesn’t require any configuration, we need just to connect the CDC source to it)

We need to create a Derived Column transformation and connect InsertOutput/ UpdateOutput/ DeleteOutput of the CDC Splitter to it.

In each one of it, we need to create a Status Column in Derived Column and set its value as :

  • 0 for InsertOutput
  • 1 for UpdateOutput
  • 2 for DeleteOutput

SQL Server CDC Control task

SQL Server CDC Control task

SQL Server CDC Control task

SQL Server CDC Control task

Then we use a Union All transformation to integrate all three data outputs together, so we can load them into the staging table using an OLE DB Destination.

Please note that we may encounter Data conversion issues between the source and the destination, in this case we may use a Data Conversion component.

SQL Server CDC Control task

You can find the repository here. Enjoy ! :D