Concept Of "Change Data Capture" (CDC) In SQL Server - Part One

Introduction

CDC records INSERTs, UPDATEs, and DELETEs are applied to SQL Server tables and make a record available of what, where, and when the changes happen, in simple relational ‘change tables’ rather than in an esoteric chopped salad of XML. These change tables contain columns that reflect the column structure of the source table you have chosen to track, along with the metadata needed to understand the changes made. Pinal Dave explains all, with plenty of examples in a simple introduction.

Change Data Capture

Often, you are told that the specification of an application requires that the value of data in an application's database must be recorded before it is changed. In other words, we are required to save all the history of the changes to the data. This feature is usually implemented for data security purposes. To implement this, I have seen a variety of solutions, from triggers, timestamps, and complicated queries (stored procedures) to audit data.

SQL Server introduced the new features of ‘after the update,’ ‘after insert,’ and ‘after delete’ triggers that almost solved the problem of tracking changes in data. A better solution was introduced in SQL Server called Change Data Capture (CDC). CDC has allowed SQL Server developers to deliver SQL Server data archiving and capturing without additional programming.

CDC is a new data tracking and capturing feature of SQL Server. It only tracks changes in user-created tables. Because captured data is then stored in relational tables, it can be easily accessed and retrieved using regular T-SQL.

When you apply the "Change Data Capture" features on a database table, a mirror of the tracked table is created with the same column structure of the original table but with additional columns that include the metadata used to summarize the nature of the change in the database table row. The SQL Server DBA can then easily monitor the activity for the logged table using these new audit tables.

Enabling Change Data Capture on a Database

First, CDC has to be enabled for the database. Because CDC is a table-level feature, it must be enabled for each table to be tracked. You can run the following query and check whether it is enabled for any database.

USE master   
GO   
SELECT [name], database_id, is_cdc_enabled   
FROM sys.databases   
GO   

This query will return the entire database name and a column showing whether CDC is enabled.

SQL Server

You can run this stored procedure in the context of each database to enable CDC at the database level. (The following script will enable the CDC in the AdventureWorks database. )

USE AdventureWorks   
GO   
EXEC sys.sp_cdc_enable_db   
GO   

As soon as CDC is enabled, it will show this result in SSMS.

SQL Server

Additionally, in the AdventureWorks database, you will see that a schema with the name ‘cdc’ has been created.

SQL Server

Some System Tables will have been created within the AdventureWorks database as part of the cdc schema.

SQL Server

The tables which have been created are listed here.

  • cdc.captured_columns– This table returns results for a list of captured columns.
  • cdc.change_tables– This table returns a list of all the tables enabled for capture.
  • cdc.ddl_history– This table contains the DDL changes history since capture data was enabled.
  • cdc.index_columns– This table contains indexes associated with the changing table.
  • cdc.lsn_time_mapping– This table maps the LSN number (which we will learn later) and time.

Enabling Change Data Capture on one or more Database Tables

The CDC feature can be applied at the table level to any database for which CDC is enabled. It has to be enabled for any table which needs to be tracked. First, run the following query to show which database tables have already been enabled for CDC.

USE AdventureWorks   
GO   
SELECT [name], is_tracked_by_cdc   
FROM sys.tables   
GO   

The above query will return a result that includes a column with the table name and a column that displays if CDC is enabled.

SQL Server

You can run the following stored procedure to enable each table. Before enabling CDC at the table level, ensure you have enabled SQL Server Agent. When CDC is enabled on a table, it creates two CDC-related jobs specific to the database and executes using SQL Server Agent. Without SQL Server Agent enabled, these jobs will not execute.

It is also essential to understand the role of the required parameter @role_name. Suppose there is any restriction on how data should be extracted from the database. In that case, this option is used to specify any role following restrictions and gating access to data to this option if there is one. If you do not specify any role and pass a NULL value instead, data access to this changed table will not be tracked and will be available to access everybody.

The following script will enable CDC on HumanResources—shift table.

USE AdventureWorks   
GO   
EXEC sys.sp_cdc_enable_table   
@source_schema = N’HumanResources’,   
@source_name = N’Shift’,   
@role_name = NULL   
GO  

SQL Server

As we are using the AdventureWorks database, it creates the jobs with the following names.

  1. cdc.AdventureWorks_capture
    The procedure sys.sp_cdc_scan is called internally by sys.sp_MScdc_capture_job. When this job is executed, it runs the system-stored procedure sys.sp_MScdc_capture_job. This procedure cannot be executed explicitly when a change data capture log scan operation is active, or the database is enabled for transactional replication. This system SP enables SQL Server Agent, which allows the Change Data Capture feature.
  2. cdc.AdventureWorks_cleanup
    When this job is executed it runs the system-stored procedure sys.sp_MScdc_cleanup_job. This system SP cleans up database changes tables.

The Stored Procedure sys.sp_cdc_enable_table enables CDC. CDC is a very powerful and versatile tool. Several options are available with this SP, but we will only mention the required options for this SP. By understanding the Stored Procedure sys.sp_cdc_enable_table, you will gain the true potential of the CDC feature. One more thing to notice is that when these jobs are created, they are also automatically enabled.

SQL Server

By default, all the columns of the specified table are considered in this operation. If you want only a few columns of this table to be tracked in that case, you can specify the columns as one of the parameters of the SP mentioned above.

When everything is completed, recheck the system tables, and you will find a new table called cdc.HumanResources_Shift_CT. This table will contain all the changes in the table HumanResources.Shift. If you expand this table, you will also find five additional columns.

As you will see, there are five additional columnsto the mirrored original table

  • __$start_lsn
  • __$end_lsn
  • __$seqval
  • __$operation
  • __$update_mask

Two values are essential to us: __$operation and __$update_mask.

Column _$operation contains a value that corresponds to DML Operations. Following is a quick list of values and their corresponding meaning.

  • Delete Statement= 1
  • Insert Statement= 2
  • Value before Update Statement= 3
  • Value after Update Statement= 4

This mask contains a value that is formed with Bit values. The column _$update_mask shows, via a bitmap, which columns were updated in the DML operation specified by _$operation. If this was a DELETE or INSERT operation, all columns are updated, so the mask contains a value with all 1’s in it.

Example of Change Data Capture

We will test this feature by doing DML operations such as INSERT, UPDATE, and DELETE on the table HumanResources. A shift that we have set up for CDC. We will observe the effects on the CDC table cdc.HumanResources_Shift_CT.

Before we start, let’s first SELECT from both tables and see what is in them.

USE AdventureWorks   
GO   
SELECT *   
FROM HumanResources.Shift   
GO   
USE AdventureWorks   
GO   
SELECT *   
FROM cdc.HumanResources_Shift_CT   
GO   

The result of the query is displayed here.

SQL Server

The original table HumanResources.Shift has three rows in it, whereas the table cdc.HumanResources_Shift_CT is empty. This table will have entries after an operation on the tracked table.

Insert Operation

Let's run an INSERT operation on the table HumanResources.Shift.

USE AdventureWorks   
GO   
INSERT INTO [HumanResources].[Shift]   
([Name],[StartTime],[EndTime],[ModifiedDate])   
VALUES (‘Tracked Shift’,GETDATE(), GETDATE(), GETDATE())   
GO   

Once the script is run, we will check the content of two of our tables HumanResources.Shift and cdc.HumanResources_Shift_CT.

SQL Server

Because of the INSERT operation, we have a newly inserted fourth row in the tracked table HumanResources.Shift. The tracking table also has the same row visible. The value of _operation is 2, meaning this is an INSERT operation.

Update Operation

To illustrate the effects of an UPDATE, we will update a newly inserted row. 

USE AdventureWorks  
GO  
UPDATE [HumanResources].[Shift]  
SET Name = ‘New Name’,  
ModifiedDate = GETDATE()  
WHERE ShiftID = 4  
GO   

Once more, we check our tables HumanResources.Shift and cdc.HumanResources_Shift_CT.

SQL Server

UPDATE operations always result in two different entries in the tracking table. One entry contains the previous values before the UPDATE is executed. The second entry is for new data after the UPDATE is executed.

The Change Data Capture mechanism always captures all the table columns unless, when CDC is set up on a table, it is restricted to tracking only a few columns. In our case, we have only changed two columns of the table, but we are tracking the complete table, so all the entries are logged before and after the update happens. We will see how this can be done later in this article.

Delete Operation

To verify this option, we will run a DELETE operation on a newly inserted row.

USE AdventureWorks   
GO   
DELETE   
FROM [HumanResources].[Shift]   
WHERE ShiftID = 4   
GO   

Once this script is run, we can see the contents of our tables HumanResources.Shift and cdc.HumanResources_Shift_CT.

SQL Server

Due to the DELETE operation, we now have only three rows in the tracked table HumanResources.Shift: The deleted row is visible in the tracking table as a new entry. The value of _operation is 4, meaning this is a delete operation.

Change Data Capture and Operations

We have now verified that using CDC can capture all the data affected by DML operations. In the tracked table, we have four values of the operation. We can see this operation’s value in the following image.

SQL Server

Understanding Update mask

It is important to understand the Update mask column in the tracking table. It is named _$update_mask. The value displayed in the field is hexadecimal but is stored as binary.

In our example, we have three different operations. INSERT and DELETE operations are done on the complete row, not individual columns. These operations are listed marked masked with 0x1F translated in binary as 0b11111, which means all five table columns.

In our example, we had an UPDATE on only two columns – the second and fifth. This is represented with 0x12 in hexadecimal value ( 0b10010 in binary). Here, this value stands for the second and fifth values if you look at it from the right as a bitmap. This is a valuable way of determining which columns are being updated or changed.

The tracking table shows two columns containing the suffix lsn, i.e., _$start_lsn and _$end_lsn. These two values correspond to the Log Sequential Number. This number is associated with the committed transaction of the DML operation on the tracked table.

Disabling Change Data Capture on a table

Disabling this feature is very simple. As we have seen earlier, if we have to enable CDC, we have to do this in two steps – at the table level and the database level; In the same way, when we have to disable this feature, we can do this at the same two levels. Let us see both of them one after one.

For dropping any table tracking, we need three values: the Source Schema, the Source Table name, and the Capture Instance. It is very easy to get the schema and table names. In our case, the schema is HumanResource, and the table name is Shift. However, we do not know the name of the Capture Instance. We can retrieve it very easily by running the following T-SQL Query.

USE AdventureWorks;   
GO   
EXEC sys.sp_cdc_help_change_data_capture   
GO   

This will return a result that contains all the three required information for disabling CDC on a table.

SQL Server

The Stored Procedure sys.sp_cdc_help_change_data_capture provides lots of other helpful information. Once we have the name of the capture instance, we can disable table tracking by running this T-SQL query. 

USE AdventureWorks;
   GO   EXECUTE sys.sp_cdc_disable_table   @source_schema   =  N’HumanResources’, 
   @source_name   =  N’Shift’, 
   @capture_instance   =  N’HumanResources_Shift’;
   GO   

Once Change Data Capture is disabled on any table, it drops the change data capture table as well as all the functions which were associated with them. It also deletes all the rows and data related to this feature from all the system tables and changes relevant data in catalog views.

SQL Server

In our example, we can see that capture table cdc.HumanResources_Shift_CT is dropped.

Disable Change Data Capture Feature on Database

This is the easiest task out of all processes. Running the following T-SQL query will disable CDC on the whole database.

USE AdventureWorks    
GO    

EXEC sys.sp_cdc_disable_db    
GO   

This Stored Procedure will delete all the data, functions, and tables related to CDC. If this data is needed, you must take a backup before dropping CDC from any database.

Capture Selected Column

When CDC is enabled on any table, it usually captures the data of all the columns. During INSERT or DELETE operations, it is necessary to capture all the data, but in UPDATE operations, only the data of the updated columns are required. CDC is not yet advanced enough to provide this kind of dynamic column selection, but CDC can let you select the columns from which data changes should be captured from the beginning.

This stored procedure should be run in the context of each database to enable it at the database level. The following script will enable CDC in the AdventureWorks database.

USE AdventureWorks   
GO   
EXEC sys.sp_cdc_enable_db   
GO   

Now we will enable this feature at the table level but only for selected columns of ShiftID and Name. This script will enable table-level change data capture for only two columns.

USE AdventureWorks    GO    EXEC sys.sp_cdc_enable_table    @source_schema   =  N’HumanResources’, 
    @source_name   =  N’Shift’, 
    @role_name   =  NULL, 
    @captured_column_list   =  ‘[ShiftID], 
[Name]’    GO   

So what’s in the system table which will be created for data capturing purposes in AdventureWorks Database?

SQL Server

So you can see that only two rows are now tracked.

We will change the data of one of the columns that weren’t specified to see the value in the cdc—HumanResources_Shift_CT table.

Before we start, let us select from both tables and observe their content.

USE AdventureWorks   
GO   
SELECT *   
FROM HumanResources.Shift   
GO   
USE AdventureWorks   
GO   
SELECT *   
FROM cdc.HumanResources_Shift_CT   
GO   

Here is the result.

SQL Server

The original table HumanResources.Shift now has three rows in it, whereas table cdc.HumanResources_Shift_CT is empty. Let's update ModifiedDate for ShiftID =1 and see if that record creates an entry in the tracking table.

USE AdventureWorks   
GO   
UPDATE [HumanResources].[Shift]   
SET ModifiedDate = GETDATE()   
WHERE ShiftID = 3   
GO   

Now, to check the contents of the tracking table cdc.HumanResources_Shift_CT and see whether that change is captured.

SQL Server

The tracking table is empty because it only tracks the changes it contains and ignores any changes in other columns.

Retrieve Captured Data of Specific Time Frame

Often, one is asked for data to be tracked over a time interval. If you look at the tracking data, no time is captured. It always provides all the information. However, a few fields can help us, i.e., _$start_lsn. LSN stands for the Last Sequence Number. An LSN uniquely identifies every record in the transaction log. They are always incrementing numbers.

LSN numbers are always associated with time, and their mapping can be found after querying system table cdc.lsn_time_mapping. This table is one of the tables which was created when the AdventureWorks database was enabled for CDC. You can run this query to get all the data in the table cdc.lsn_time_mapping.

USE AdventureWorks   
GO   
SELECT *   
FROM cdc.lsn_time_mapping   
GO   

When this query is run, it will give us all the rows of the table. Finding the necessary information from all the data is a little difficult. The usual case is when we must inspect a change that occurred in a particular period.

SQL Server

We can find the time that corresponds to the LSN by using the system function sys.fn_cdc_map_time_to_lsn. If we want all the changes done yesterday, we can run this function as described below, and it will return all the rows from yesterday.

Before we run this query, let us explore two table-valued functions (TVF) in the AdventureWorks database. You can see that there are two new TVFs created with schema cfc. These functions were created when table-level CDC was enabled.

SQL Server

The function cdc.fn_cdc_get_all_changes_HumanResources_Shift can be used to get events that occurred over a particular period. You can run this T-SQL script to get events during any specific period. In our case, we will be retrieving this data for the past 24 hours.

The following query should retrieve the data modified in the past 24 hours.

USE AdventureWorks   
GO   
DECLARE @begin_time DATETIME, @end_time DATETIME, @begin_lsn BINARY(10), @end_lsn BINARY(10);   
SELECT @begin_time = GETDATE()-1, @end_time = GETDATE();   
SELECT @begin_lsn = sys.fn_cdc_map_time_to_lsn(‘smallest greater than’, @begin_time);   
SELECT @end_lsn = sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, @end_time);   
SELECT *   
FROM cdc.fn_cdc_get_all_changes_HumanResources_Shift(@begin_lsn,@end_lsn,’all’)   
GO   

We have used relational operations in the function sys.fn_cdc_map_time_to_lsn. There can be a total of four different relational operations available to use in that function:

  • largest less than
  • largest less than or equal
  • smallest greater than
  • smallest greater than or equal

This way, the captured data can be queried easily and query based on time intervals.

Automatic Cleanup Process

If we track every change of all the data in our database, there is a chance that we will outgrow the main server's hard drive. This will also lead to maintenance and input/output buffer issues.

In CDC, an automatic cleanup process runs at regular intervals. By default, the gap is of 3 days, but it can be configured. When we enabled CDC on the database, we observed that one additional system-stored procedure was created with the name sys.sp_cdc_cleanup_change_table, which cleans up all the tracked data at intervals.

Summary

For years, programmers have tried to create systems that record all the changes made to the data in a database application. At last, with SQL Server, we have a robust way, CDC, that comes ‘out of the box’ to deliver this functionality in a standard way. This should be useful for auditing databases and tracking obscure problems requiring you to know exactly when and where a change to a base table was made.

This article has been written keeping in mind SQL Server SP1 Cumulative Update 3. I would encourage any of your suggestions or ideas on this subject as comments to the article.


Similar Articles