Concept Of Change Data Capture (CDC) In SQL Server - Part Three

Let's continue from my last two posts. Here is some more information regarding CDC

"Now, how does SQLCDC handle new columns being added to a source table being monitored/captured by SQLCDC?"

I’ll first address the inverse of this question. How does SQLCDC handle columns that are removed from a source table? The answer is that SQL CDC does the best it can do, by not altering the source table’s corresponding Change Table schema so as not to break any downstream consuming applications/services. And thus, any columns dropped from the source table remain in the change table with NULLs being inserted into the removed column going forward.

Now, moving on, in regard to this actual question, SQLCDC (for the same reasons cited above) does not alter a source table’s corresponding change table when a new column gets added to a source table being captured by SQLCDC. To enable SQLCDC to capture the new column you simply need to create a second capture instance which will then capture the new column and it’s data.

Note

A source table being captured by SQL CDC can only have up to 2 Capture Instances. If a source table incurs regular columns being added/removed you will need to recycle your capture instances. 

  1. USE [AdventureWorks]   
  2. GO   
Step 1

Add a New Column to the HumanResources.Employee Table.
  1. ALTER TABLE humanresources.employee   
  2. ADD [NewCol] INT;   
  3. GO   
A source table can only have 2 capture instances associated with it!

Step 2

Create a new, second capture instance to facilitate the capture of change data for the new column.
  1. EXEC Sp_cdc_enable_table   
  2. humanresources ,   
  3. employee ,   
  4. ‘HumanResources_Employee_New’ ,   
  5. 1 ,   
  6. dbo   
  7. GO   
Step 3

Sample DML Statement.
  1. UPDATE humanresources.employee   
  2. SET newcol = 1   
  3. WHERE employeeid = 1   
  4. GO   
Step 4

Create a new, second UDF that uses the new capture instance to collect the new column’s data.
  1. CREATE FUNCTION [dbo].[Udf_employee_newcol] ( @start_time DATETIME,   
  2. @end_time                                                 DATETIME )   
  3. returns @EMPLOYEE TABLE   
  4.                         (   
  5.                                                 employeeid       INT,   
  6.                                                 nationalidnumber NVARCHAR(15),   
  7.                                                 contactid        INT,   
  8.                                                 managerid        INT,   
  9.                                                 title            NVARCHAR(50),   
  10.                                                 birthdate        DATETIME,   
  11.                                                 maritalstatus NCHAR(1),   
  12.                                                 gender NCHAR(1),   
  13.                                                 hiredate DATETIME,   
  14.                                                 salariedflag [FLAG],   
  15.                                                 vacationhours  SMALLINT,   
  16.                                                 sickleavehours SMALLINT,   
  17.                                                 currentflag [FLAG],   
  18.                                                 rowguid UNIQUEIDENTIFIER,   
  19.                                                 [user]        NVARCHAR(50),   
  20.                                                 [NewCol]      INT,   
  21.                                                 cdc_operation VARCHAR(1)   
  22.                         )   
  23.                         AS BEGIN –declare local variables TO hold lsns   
  24. DECLARE @from_lsn binary(10),   
  25.   @to_lsn         binary(10) –map the time interval TO a change data capture query range.IF (@start_time IS NULL)   
  26. BEGIN   
  27.   SELECT @from_lsn = sys.fn_cdc_get_min_lsn( ‘humanresources_employee_new’ )   
  28. END   
  29. ELSE   
  30. BEGIN   
  31.   SELECT @from_lsn = sys.fn_cdc_map_time_to_lsn( ‘smallest greater than   
  32.   OR     equal’, @start_time )   
  33. END   
  34. IF (@end_time IS NULL)   
  35. BEGIN   
  36.   SELECT @to_lsn = sys.fn_cdc_get_max_lsn()   
  37. END   
  38. ELSE   
  39. BEGIN   
  40.   SELECT @to_lsn = sys.fn_cdc_map_time_to_lsn( ‘largest less than   
  41.   OR     equal’, @end_time )   
  42. END   
  43. –if same THEN   
  44. EXIT   
  45. IF ( @from_lsn = sys.fn_cdc_increment_lsn(@to_lsn) )   
  46. BEGIN   
  47.   RETURN   
  48. END   
  49. — query FOR change data   
  50. INSERT INTO @Employee   
  51. SELECT employeeid,   
  52.        nationalidnumber,   
  53.        contactid,   
  54.        managerid,   
  55.        title,   
  56.        birthdate,   
  57.        maritalstatus,   
  58.        gender,   
  59.        hiredate,   
  60.        salariedflag,   
  61.        vacationhours,   
  62.        sickleavehours,   
  63.        currentflag,   
  64.        rowguid,   
  65.        [user],   
  66.        [NewCol],   
  67.        CASE __$operation   
  68.               WHEN 1 THEN ‘d’   
  69.               WHEN 2 THEN ‘i’   
  70.               WHEN 4 THEN ‘u’   
  71.               ELSE NULL   
  72.        END AS cdc_operation   
  73. FROM   cdc.fn_cdc_get_net_changes_humanresources_employee_new(@from_lsn, @to_lsn, ’all’)   
  74. RETURN   
  75. ENDgo  
Step 5

Query the new UDF for the sample DML.
  1. SELECT *   
  2. FROM [dbo].[Udf_employee_newcol](NULL,NULL)   
  3. GO  

UPDATE 2

I have received several questions regarding how/if you can capture the user that commits DML statements to a change table. As I mentioned in my feedback for this post, there is no out-of-the-box solution to capture the user who commits the changes (DML) being captured, however you can build a custom solution, such as the script below.

  1. --WARNING! ERRORS ENCOUNTERED DURING SQL PARSING!   
  2. USE [AdventureWorks] GO –append new user COLUMN TO the HumanResources.Employee TABLE  
  3.   
  4. ALTER TABLE humanresources.employee ADD [User] NVARCHAR(50) GO –disable ANY triggers DISABLE TRIGGER ALL ON HumanResources.employee GO –enable CDC  
  5. FOR AW  
  6.   
  7. EXEC Sp_cdc_enable_db GO –enable CDC  
  8. FOR HumanResources.Employee  
  9.   
  10. EXEC Sp_cdc_enable_table humanresources  
  11.     ,employee  
  12.     ,NULL  
  13.     ,1  
  14.     ,dbo GO –create sample UDF TO  
  15.   
  16. RETURN change data  
  17.   
  18. CREATE FUNCTION [dbo].[Udf_employee] (  
  19.     @start_time DATETIME  
  20.     ,@end_time DATETIME  
  21.     )  
  22. RETURNS @Employee TABLE (  
  23.     employeeid INT  
  24.     ,nationalidnumber NVARCHAR(15)  
  25.     ,contactid INT  
  26.     ,managerid INT  
  27.     ,title NVARCHAR(50)  
  28.     ,birthdate DATETIME  
  29.     ,maritalstatus NCHAR(1)  
  30.     ,gender NCHAR(1)  
  31.     ,hiredate DATETIME  
  32.     ,salariedflag [FLAG]  
  33.     ,vacationhours SMALLINT  
  34.     ,sickleavehours SMALLINT  
  35.     ,currentflag [FLAG]  
  36.     ,rowguid UNIQUEIDENTIFIER  
  37.     ,[user] NVARCHAR(50)  
  38.     ,cdc_operation VARCHAR(1)  
  39.     )  
  40. AS  
  41. BEGIN  
  42.     –declare LOCAL variables TO hold LSNs  
  43.   
  44.     DECLARE @from_lsn BINARY (10)  
  45.         ,@to_lsn BINARY (10) –Map the TIME interval TO a change data capture query range.  
  46.   
  47.     IF (@start_time IS NULL)  
  48.     BEGIN  
  49.         SELECT @from_lsn = sys.Fn_cdc_get_min_lsn(‘HumanResources_Employee’)  
  50.     END  
  51.     ELSE  
  52.     BEGIN  
  53.         SELECT @from_lsn = sys.Fn_cdc_map_time_to_lsn(‘smallest greater than  
  54.                 OR equal’, @start_time)  
  55.     END  
  56.   
  57.     IF (@end_time IS NULL)  
  58.     BEGIN  
  59.         SELECT @to_lsn = sys.Fn_cdc_get_max_lsn()  
  60.     END  
  61.     ELSE  
  62.     BEGIN  
  63.         SELECT @to_lsn = sys.Fn_cdc_map_time_to_lsn(‘largest less than  
  64.                 OR equal’, @end_time)  
  65.     END –if same then EXIT  
  66.   
  67.     IF (@from_lsn = sys.Fn_cdc_increment_lsn(@to_lsn))  
  68.     BEGIN  
  69.         RETURN  
  70.     END — Query  
  71.         FOR change data  
  72.   
  73.     INSERT INTO @Employee  
  74.     SELECT employeeid  
  75.         ,nationalidnumber  
  76.         ,contactid  
  77.         ,managerid  
  78.         ,title  
  79.         ,birthdate  
  80.         ,maritalstatus  
  81.         ,gender  
  82.         ,hiredate  
  83.         ,salariedflag  
  84.         ,vacationhours  
  85.         ,sickleavehours  
  86.         ,currentflag  
  87.         ,rowguid  
  88.         ,[user]  
  89.         ,CASE __$operation  
  90.             WHEN 1  
  91.                 THEN ‘D’  
  92.             WHEN 2  
  93.                 THEN ‘I’  
  94.             WHEN 4  
  95.                 THEN ‘U’  
  96.             ELSE NULL  
  97.             END AS cdc_operation  
  98.     FROM cdc.Fn_cdc_get_net_changes_humanresources_employee(@from_lsn, @to_lsn, ’all’)  
  99.   
  100.     RETURN  
  101. END GO –sample DML statement  
  102.   
  103. UPDATE humanresources.employee  
  104. SET title = ‘Marketing Director’  
  105.     ,[User] = system_user  
  106. WHERE employeeid = 13 –get the net change OF the sample DML including the User  
  107.   
  108. SELECT *  
  109. FROM [dbo].[Udf_employee](NULLNULL)  

Introduction

One of the SQL Server 2008’s biggest BI features being touted is Change Data Capture (CDC). CDC is basically supposed to be a built-in solution to the old-age practice in ETL solutions of identifying and using change identifiers columns in source systems. I have now spent a fair amount of time using this feature and more importantly how to leverage it inside of SSIS packages for incremental ETL solutions. My work here has been to prepare for an upcoming demonstration of CDC in SSIS. This post/Q&A is a brief summary of my findings thus far…

*CDC is being positioned as the ‘design of choice’ for SQL Server 2008+ OLTP database servers for exposing changed relational data for data warehousing consumption purposes.

What is Change Data Capture (CDC)?

CDC records (or captures) DML activity on designated tables. CDC works by scanning the transaction log for a designated table’s ‘captured columns’ whose content has changed and then making those changes available for data syncronizing purposes in a relational format. As you can see, this feature in entrenched in transaction log architecture and thus, a lot of metadata in CDC is related around the concept of a Log Sequence Number (LSN).

So whats a LSN?

Here is the definition of a LSN per Books Online: "Every record in the Microsoft SQL Server transaction log is uniquely identified by a log sequence number (LSN). LSNs are ordered such that if LSN2 is greater than LSN1, the change described by the log record referred to by LSN2 occurred after the change described by the log record LSN. "

How do I get CDC?

CDC is a feature of SQL Server 2008 Enterprise, Developer, and Evaluation editions.

What are the target applications or consumers of the CDC technology?

ETL Solutions are the most common, however any data consuming application that requires syncronizing data could benefit from the technology.

Is CDC configurable via a UI or just TSQL?

As of this time, just TSQL.

How do you configure CDC?

  1. Enable CDC for a database
  2. Enables the current database via the USE statement *select is_cdc_enabled from sys.databases where [name] = ‘AdventureWorks’ to determine if DB is allready enabled *Also note that when you do this all of the below system objects get created in the selected database
  3. Enable CDC for a given table and it’s selected columns
  4. Specify the captured table’s schema, name, database role, capture instance name (defaults to schema_name), support net changes (bit, set it to 1 if you want both change data table-valued functions created), name of captured table’s unique index, captured column list (null/default to all columns), filegroup for the change table (null/defaults to default filegroup)
  5. Query Change Data via 1 of 2 built in table-valued functions created during step #2
  6. For all changes (meaning a row is returned for each DML) use cdc.fn_cdc_get_all_changes_<capture_instance>
  7. For the net changes (meaning one row returned for each source row modified among 1 or more DMLs) use cdc.fn_cdc_get_net_changes_<capture_instance>

What are all of the CDC system objects available to me?

System Tables

  • cdc.captured_columns
  • cdc.change_tables
  • cdc.ddl_history
  • cdc.index_columns
  • cdc.lsn_time_mapping
  • cdc.Schema_Name_CT (change tables) *this is just the default naming convention, configurable via the enable table sysproc

DMVs

  • sys.dm_cdc_log_scan_sessions
  • sys.dm_repl_traninfo
  • sys.dm_cdc_errors

System Stored Procedures

  • sys.sp_cdc_enabledb
  • sys.sp_cdc_disabledb
  • sys.sp_cdc_cleanup_change_table
  • sys.sp_cdc_disable_db_change_data_capture
  • sys.sp_cdc_disable_table_change_data_capture
  • sys.sp_cdc_enable_db_change_data_capture
  • sys.sp_cdc_enable_table_change_data_capture
  • sys.sp_cdc_get_ddl_history
  • sys.sp_cdc_get_captured_columns
  • sys.sp_cdc_help_change_data_capture

System Functions

  • cdc.fn_cdc_get_all_changes_<capture_instance>
  • cdc.fn_cdc_get_net_changes_<capture_instance>
  • sys.fn_cdc_decrement_lsn
  • sys.fn_cdc_get_column_ordinal ( ‘capture_instance’ , ‘column_name’ )
  • sys.fn_cdc_get_max_lsn
  • sys.fn_cdc_get_min_lsn
  • sys.fn_cdc_has_column_changed
  • sys.fn_cdc_increment_lsn
  • sys.fn_cdc_is_bit_set
  • sys.fn_cdc_map_lsn_to_time
  • sys.fn_cdc_map_time_to_lsn

Do the change tables keep growing?

No, there is an automatic cleanup process that occurs every three days (and this is configurable). For more intense environments, you can leverage the manual method using the system stored procedure: sys.sp_cdc_cleanup_change_table. When you execute this system procedure, you specify the low LSN and any change records occuring before this point are removed and the start_lsn is set to the low LSN you specified.

How do you leverage CDC in SSIS Packages?

Books Online in CTP5 (November) actually has a sample package in the topic ‘change data capture in integration services’ and I found this to be a good starting point to build from. For my CDC/SSIS demo here is the Control/Data Flow I am using,

  1. Calculate Date Intervals (these will correspond to LSNs later) *also note that in both BOL and my own package we are using fixed intervals, in the real world this will be driven by a table solution which tells the SSIS package when the last successful execution occurs (starting point of next package iteration)

  2. Check is any data is available in the selected date/time interval. This is important because the rest of the package will fail if no data is ready. BOL recommends performing Thead.Sleep/WAITFORs here. I am not for demo purposes but its not a bad idea.

  3. Build the query via a SSIS variable *BOL states that SSIS packages cannot call the cdc.fn_cdc_getnet|all functions and must use a wrapper. Whether or not we end up being forced to do this, it is a good design practice, below is my custom function that SSIS calls passing in the start/end datetime values to get the actual change data in the data flow step below.

  4. Create a data flow task that executes the SSIS variable query (OLEDB source), and then splits the rows into via a conditional split based on the CDC_OPERATION column calculated in the function below.
    1. --WARNING! ERRORS ENCOUNTERED DURING SQL PARSING!   
    2. CREATE FUNCTION [dbo].[Udf_promotion] (  
    3.     @start_time DATETIME  
    4.     ,@end_time DATETIME  
    5.     )  
    6. RETURNS @Promotion TABLE (  
    7.     specialofferid INT  
    8.     ,DESCRIPTION NVARCHAR(255)  
    9.     ,discountpct SMALLMONEY  
    10.     ,[Type] NVARCHAR(50)  
    11.     ,category NVARCHAR(50)  
    12.     ,startdate DATETIME  
    13.     ,enddate DATETIME  
    14.     ,minqty INT  
    15.     ,maxqty INT  
    16.     ,rowguid UNIQUEIDENTIFIER  
    17.     ,modifieddate DATETIME  
    18.     ,cdc_operation VARCHAR(1)  
    19.     )  
    20. AS  
    21. BEGIN  
    22.     –declare LOCAL variables TO hold LSNs  
    23.   
    24.     DECLARE @from_lsn BINARY (10)  
    25.         ,@to_lsn BINARY (10) –Map the TIME interval TO a change data capture query range.  
    26.   
    27.     IF (@start_time IS NULL)  
    28.     BEGIN  
    29.         SELECT @from_lsn = sys.Fn_cdc_get_min_lsn(‘Sales_SpecialOffer’)  
    30.     END  
    31.     ELSE  
    32.     BEGIN  
    33.         SELECT @from_lsn = sys.Fn_cdc_map_time_to_lsn(‘smallest greater than  
    34.                 OR equal’, @start_time)  
    35.     END  
    36.   
    37.     IF (@end_time IS NULL)  
    38.     BEGIN  
    39.         SELECT @to_lsn = sys.Fn_cdc_get_max_lsn()  
    40.     END  
    41.     ELSE  
    42.     BEGIN  
    43.         SELECT @to_lsn = sys.Fn_cdc_map_time_to_lsn(‘largest less than  
    44.                 OR equal’, @end_time)  
    45.     END –if same then EXIT  
    46.   
    47.     IF (@from_lsn = sys.Fn_cdc_increment_lsn(@to_lsn))  
    48.     BEGIN  
    49.         RETURN  
    50.     END — Query  
    51.         FOR change data  
    52.   
    53.     INSERT INTO @Promotion  
    54.     SELECT specialofferid  
    55.         ,DESCRIPTION  
    56.         ,discountpct  
    57.         ,[Type]  
    58.         ,category  
    59.         ,startdate  
    60.         ,enddate  
    61.         ,minqty  
    62.         ,maxqty  
    63.         ,rowguid  
    64.         ,modifieddate  
    65.         ,CASE __$operation  
    66.             WHEN 1  
    67.                 THEN ‘D’  
    68.             WHEN 2  
    69.                 THEN ‘I’  
    70.             WHEN 4  
    71.                 THEN ‘U’  
    72.             ELSE NULL  
    73.             END AS cdc_operation  
    74.     FROM cdc.Fn_cdc_get_net_changes_sales_specialoffer(@from_lsn, @to_lsn, ’all’)  
    75.   
    76.     RETURN  
    77. END GO  

Overall Impression?

I’m quite impressed with the new CDC feature built into the SQL Server relational engine. It is fairly easy to configure and use. In terms of enhancements, I think the matter of replacing the typical ‘ETL Run’ table should be considered. For example, when we configure this new technology going forward to replace our old source ‘change identifier’ columns we will still need a solution (usually table-based) for tracking when our ETL Packages have last ran (both successful and failed). It would be nice if we could somehow configure the CDC technology to say ‘this SSIS Package is the only consuming application, therefore purge the data in a transactional manner once I retrieve it (which we could do manual right now I believe)’. Basically, give the option of replacing the LSN filtering of records to just retrieve and purge to so I can rely on the fact that the change table only contains data I have yet to process. Also, CDC can only work with SQL Server. While this is an obvious constraint, it would be nice to overcome it since SQL Server BI Systems will typically query other types of RDBMS for their source system data and not just SQL Server.

Overall, the feature is great. It replaces the need for source column change identifers, and it relocates the ETL querying load to a seperate change table in the source system. If your organization has a complete SQL Server based environment, CDC is a great candidate for your future ETL solutions.

CDC is only available in SQL Server 2008 Enterprise, Developer ,and Evaluation editions. Furthermore, CDC is enabled at two levels, first a database level and then on specific tables you identify via the sys.sp_cdc_enable_table command.

  1. sys.sp_cdc_enable_table   
  2. [ @source_schema = ] ‘source_schema’,   
  3. [ @source_name = ] ‘source_name’ ,   
  4. [ @role_name = ] ‘role_name’   
  5. [,[ @capture_instance = ] ‘capture_instance’ ]   
  6. [,[ @supports_net_changes = ] supports_net_changes ]   
  7. [,[ @index_name = ] ‘index_name’ ]   
  8. [,[ @captured_column_list = ] ‘captured_column_list’ ]   
  9. [,[ @filegroup_name = ] ‘filegroup_name’ ]   
  10. [,[ @partition_switch = ] ‘partition_switch’ ]   
If you have a Master|Child detail tables you would have to enable both tables for CDC if you need to detect and consume the change data of both.