What is Change Data Capture(CDC)

Change Data Capture?

Change data capture (CDC) is pivotal in data management and integration, enabling organizations to track and propagate data modifications efficiently in real-time or near-real-time. In traditional scenarios, synchronizing systems with database changes posed challenges.

CDC addresses this by continuously monitoring databases for changes and capturing them as they happen. Rather than periodically polling the entire database, CDC systems utilize transaction logs or similar mechanisms to capture only altered data, minimizing processing overhead and ensuring immediate synchronization.

Source databse

Benefits of CDC

There are many use cases for CDC in your overall data integration strategy. You may be moving data into a data warehouse or data lake, creating an operational data store, or a replica of the source data in real-time. Or even implement a modern data fabric architecture. Ultimately, CDC will help your organization obtain greater value from your data by allowing you to integrate and analyze data faster—and use fewer system resources in the process. Here are some key benefits:

  • Real-Time Data Integration: CDC enables organizations to keep multiple databases, applications, or data warehouses in sync with minimal latency. This real-time data integration ensures that all systems have access to the most up-to-date information.
  • Reduced ETL Overhead: Unlike traditional Extract, Transform, Load (ETL) processes, which involve extracting entire datasets and transforming them before loading, CDC focuses only on the changed data, reducing processing time and resource consumption.
  • Improved Data Accuracy: By capturing changes as they happen, CDC minimizes the risk of data discrepancies or conflicts between different systems, ensuring data accuracy across the organization.
  • Support for Data Warehousing and Analytics: CDC plays a crucial role in feeding data warehouses and analytical systems with fresh, real-time data, enabling timely decision-making and actionable insights.
  • Operational Efficiency: With CDC, organizations can streamline data replication, synchronization, and distribution processes, leading to greater operational efficiency and agility.
  • Compliance and Auditability: CDC solutions often provide comprehensive auditing capabilities, allowing organizations to track and trace data changes for compliance purposes and regulatory requirements.

Why Change Data Capture?

Initially, CDC became popular as an alternative solution to batch data replication for populating data warehouses for Extract Transform Load (ETL) jobs. In recent years, CDC has become the de facto method for migrating to the cloud.

The challenge is that data is constantly changing, which could cause databases, data lakes, and data warehouses to be out of sync. Additionally, organizations are increasingly migrating to the cloud, which increases the chances of ending up with data silos. This has led to CDC becoming a popular solution to bridge the on-premises and cloud environments and let enterprises either migrate to the cloud at their own pace or continue to operate in a hybrid environment.

Why Change Data Capture Matters in the Modern Enterprise?

Change Data Capture (CDC) is crucial for modern enterprises due to the central role of data in their operations. As businesses undergo digital transformation, data becomes fundamental for decision-making and operational efficiency. With the shift towards cloud-based data architectures and real-time data management, enterprises face challenges in keeping pace with increasing data volumes, variety, and velocity. The emergence of cloud data warehouses, data lakes, and streaming technologies addresses these challenges to some extent.

However, the value of data is time-sensitive, and real-time insights are essential for competitive advantage. Delayed action on insights leads to missed opportunities, often referred to as "perishable insights." Various sources such as log files, machine logs, IoT devices, weblogs, and social media generate perishable data, which can provide significant value but quickly lose relevance.

To avoid missing out on business opportunities due to perishable insights, enterprises need to capture data changes and updates from transactional sources in real-time. Change Data Capture (CDC) offers a solution by enabling the continuous monitoring and capture of data changes, ensuring that businesses can leverage real-time insights for informed decision-making and competitive advantage.

How Does Change Data Capture Work?

Change Data Capture (CDC) operates by monitoring and capturing data changes in the source database, typically a relational database like MySQL, Microsoft SQL Server, Oracle, or PostgreSQL, and then transmitting these changes to downstream systems such as caches, search indexes, data warehouses, or data lakes.

There are two main approaches to CDC: push and pull.

Push

In the push approach, the source database takes the lead in capturing changes and transmitting them to downstream systems in near real-time. This ensures that target systems receive the latest data promptly. However, there is a risk of data loss if target systems are unreachable or offline. To mitigate this, a messaging system is often used to buffer changes until they can be delivered to the target systems.

Pull

In contrast, the pull approach involves the source database simply logging data changes, with the responsibility for retrieving and processing these changes falling on the target systems. Target systems periodically poll the source database for updates, which are then processed. Similar to the push method, a messaging system is necessary to prevent data loss when target systems are unavailable. However, the drawback of the pull approach is that target systems are not immediately notified of data changes, resulting in a delay before they can act on the updates.

Both push and pull approaches have their own sets of advantages and disadvantages, which must be carefully considered based on the specific requirements of the use case.

Change Data Capture in ETL

Change Data Capture (CDC) plays a vital role within the ETL (Extract, Transform, Load) process.

  • Extract: Traditionally, data extraction involved bulk queries, which could be inefficient when dealing with continuously updated source data. However, CDC addresses this issue by enabling real-time or near-real-time extraction, providing a reliable stream of change data.
  • Transformation: In the past, ETL tools transformed data in a staging area before loading, which could be time-consuming for large datasets. Nowadays, with the shift towards ELT (Extract, Load, Transform) pipelines, data is loaded immediately and transformed within the target system, such as a cloud-based data warehouse or data lake. This transformation can occur either on a micro-batch timescale, processing only the data modified since the last successful load, or on a CDC timescale, continuously processing data changes as they occur at the source.
  • Load: This phase involves loading the transformed data into the target system, where it can be analyzed by business intelligence (BI) or analytics tools.
    Date warehouse

Change Data Capture Patterns

Change Data Capture (CDC) employs various methods to detect data changes, with the following being the most commonly used patterns.

  1. Timestamp-based: Database designers can incorporate a column into table schemas to indicate the time of the latest change, such as LAST_MODIFIED or LAST_UPDATED. Downstream systems can query this field to retrieve records updated since the last execution time.
    Timestamp
    • Pros
      • Simple to implement and utilize.
    • Cons
      • Limited to processing soft deletes and not DELETE operations.
      • Adds computational overhead to the source system, requiring target systems to scan each row to identify the last updated values.
    • Example: A snapshot of the table when a new record is created and when the record with ID=101 is updated.
  2. Trigger-based: Most databases support trigger functions, which are stored procedures automatically executed upon specific events like INSERT, UPDATE, or DELETE operations. Triggers capture data changes and store them in a separate table (commonly known as a shadow or event table) within the same database. Developers can integrate messaging systems to publish these changes to queues for relevant target systems to subscribe to.
    Trigger-based
    • Pros
      • Capable of detecting and capturing all types of changes (INSERT, UPDATE, and DELETE).
      • Widely supported by databases.
    • Cons
      • Adversely affects source database performance, as updating records requires multiple writes.
      • Requires modifications to the source database schema.
  3. Log-based: Transactional databases log all committed changes (INSERT, UPDATE, DELETE) along with their timestamps into transaction log files. These logs, primarily intended for backup and disaster recovery, can also be utilized to propagate changes to target systems. Real-time data change capture is enabled without imposing computational overhead on source databases.
    Log miner
    • Pros
      • Does not burden source databases with computational overhead.
      • Detects and captures all types of changes (INSERT, UPDATE, DELETE).
      • No need for schema changes in source databases.
    • Cons
      • Lack of standardization around transaction log formats, with each vendor implementing their own methodology, which may change in future releases.
      • Target systems must identify and discard changes that were written to source databases but subsequently rolled back.

Change Data Capture Use Cases

Change Data Capture (CDC) offers various use cases across different domains and industries.

  1. Continuous Data Replication: Instead of the traditional batch mode copying of entire databases, CDC enables continuous replication of only changed data to downstream destinations, ensuring real-time synchronization without disrupting source databases.
  2. Integration with Microservices Architecture: CDC facilitates data transfer from source databases to multiple destination systems, maintaining synchronization between them during the transition to microservices architecture.
  3. Cloud Adoption: Organizations migrating to the cloud benefit from CDC by leveraging cloud-native services for data management, reducing total cost of ownership (TCO), and improving agility and scalability.
  4. Real-Time Fraud Detection: CDC, combined with machine learning, enables real-time ingestion, transformation, and analysis of transactional data for proactive fraud detection in sectors like finance, facilitating immediate action to mitigate risks.
  5. Real-Time Marketing Campaigns: CDC captures and processes real-time customer activity data, enabling dynamic adjustments to marketing strategies, personalized customer experiences, and immediate action based on browsing behavior.

Additional real-world use cases for the CDC include.

  • Microservices Integration: Syncing data changes between legacy systems and microservices-based applications.
  • Compliance and Auditing: Saving data change history to meet compliance and auditing requirements.
  • Analytics Dashboards: Feeding data changes to analytics dashboards for informed decision-making.
  • Cache Invalidation: Automatically invalidating outdated cache entries to maintain data integrity.
  • Full-Text Search: Keeping full-text search indexes aligned with database changes.
  • CQRS Model Updates: Synchronizing Command Query Responsibility Separation (CQRS) read models with primary models.

Change Data Capture, Data Streaming, and ETL

In the realm of data management, Change Data Capture (CDC), Data Streaming, and Extract, Transform, Load (ETL) processes intersect, each serving a critical role in capturing, processing, and delivering data to target systems such as data warehouses or data lakes. Let's delve deeper into how these components complement each other and contribute to the broader landscape of data integration.

Change Data Capture (CDC)

Change Data Capture (CDC) focuses on capturing real-time changes in source data, ensuring that any new entries, updates, or deletions are promptly reflected in the target system. By concentrating on changes rather than entire datasets, CDC offers a more efficient and immediate approach to data replication.

Data Streaming

Data Streaming, on the other hand, is responsible for transmitting captured data to target systems without delay. It operates as a conduit, swiftly transporting freshly captured data to its destination. Data Streaming solutions provide robust and scalable platforms for near real-time data transmission, ensuring that changes are propagated swiftly and efficiently.

Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL) encompasses a three-step process: extracting data from source systems, transforming it into the desired format, and loading it into target systems. While traditional ETL processes operated in batch modes, with data processed at scheduled intervals, the emergence of CDC and Data Streaming has ushered in the era of real-time ETL. In this paradigm, data is continuously extracted, transformed, and loaded as changes occur, ensuring that target systems receive up-to-date information in real-time.

The synergy of CDC, Data Streaming, and ETL is evident in modern data integration practices. CDC captures changes in real-time, Data Streaming transmits these changes promptly, and ETL processes ensure seamless processing and integration into target systems. Together, these components form a robust framework for real-time data integration, empowering businesses to leverage the full potential of their data assets.

Change Data Capture Tools

Here are a couple of popular open-source Change Data Capture tools worth considering.

  • Maxwell: Maxwell is a tool that reads MySQL binlogs and writes row updates in JSON format to platforms such as Kafka and Kinesis. With low operational overhead, Maxwell is easy to set up, requiring only MySQL and space for writing.
  • Debezium: Debezium is an open-source CDC platform built on Apache Kafka. It offers connectors to capture change streams from databases like MySQL, PostgreSQL, and MongoDB, forwarding them to Kafka via Kafka Connect. Debezium can be used as a standalone server or embedded into application code as a library.
    Kafka

Change Data Capture and Apache Kafka

Change Data Capture (CDC) enables the capture of data changes within source databases, necessitating effective communication and propagation of these changes to downstream systems. Apache Kafka and Kafka Connect excel in this regard. Apache Kafka serves as an open-source event streaming platform, offering capabilities to durably write, store, and process streams of events in real-time or retrospectively. As a distributed system of servers and clients, Kafka ensures reliable and scalable performance.

At the core of Apache Kafka lies the Kafka Connect API, introduced in version 0.9. Kafka Connect facilitates seamless integration between Kafka and other systems, enabling both the transmission of data to and the reception of data from these systems. Notably, Kafka Connect operates on a configuration-driven approach, eliminating the need for code development. Instead, it relies solely on configuration files, providing a straightforward integration solution for developers.

A prevalent use case for Kafka Connect is database change data capture. Utilizing connectors like Confluent's JDBC or Debezium CDC connectors, users can effortlessly integrate Kafka with their databases and stream data into Confluent. The JDBC connector, for instance, polls source databases for new or modified data based on updated timestamp columns. Additionally, Confluent Cloud offers fully managed source connectors for databases such as Microsoft SQL Server, PostgreSQL, MySQL, and Oracle.

Furthermore, Confluent extends log-based CDC source connectors for databases like Microsoft SQL Server, PostgreSQL, MySQL, and Oracle. These connectors function by capturing an initial snapshot of existing data upon startup, subsequently monitoring and recording all row-level changes to that data.

Summary

CDC technology finds applications across various industries, including finance, healthcare, retail, manufacturing, and more. It serves as a foundational component in building modern data architectures, supporting initiatives such as real-time analytics, event-driven architectures, and microservices.

Overall, CDC empowers organizations to harness the full potential of their data by ensuring that it remains accurate, up-to-date, and readily available across the entire ecosystem of applications and systems.