跳轉到

Change Data Capture

What is Change Data Capture?

Change Data Capture is a software architecture that allows detecting and capturing changes made to data in a database and sending these changes, sometimes in real-time, to a downstream process or system. More specifically, CDC entails recording INSERT, UPDATE, and DELETE transactions applied to a table.

Capturing every change from a source database and moving it to a target – such as a data warehouse or data lake – helps keep systems in sync. Furthermore, because changes are captured with low latency, CDC enables near real-time analytics and data science.

Change Data Capture is also ideal for modern cloud architectures because it is a highly efficient way of moving data across a wide area network.

Change Data Capture vs. Batch processing

Data replication is a process used to keep datastores in sync, which is often achieved using extract, transform, and load (ETL) pipelines. Since CDC concerns the extraction process, it can be used in ETL and ELT pipelines.

Traditionally, an ETL pipeline loads data into a database, data warehouse, or data lake in batches. With batch processing, the ETL is scheduled to be periodically executed (every hour or every day, for example). ETLs can replicate the full source database or only the new and updated data – these types of replication are called full replication and incremental replication, respectively.

But traditional batch processing has a few general drawbacks:

  • Data is not replicated in real-time.
  • It can be inefficient, especially when performing full replication.
  • It can overload the source database by regularly querying for new or updated data.
  • It isn’t easy to track deletes in the source database when using incremental replication.

Change Data Capture methods

Table metadata

This method keeps track of metadata across every row in a table, including when the row was created and updated. Using this method requires additional columns in the original table (such as created_at and updated_at) or a separate table to track these different metadata elements.

Tracking metadata is commonly used in incremental batch processing to identify new and updated rows.

There are many ways to identify new and updated rows in the source table. The most common way is to look at the updated_at column in the destination table before replication to know the latest update and then identify the rows with a later updated_at in the source table. The result is the new and updated rows that should be merged at the destination.

Key challenges:

  • If there are hard deletes in the source database, it’s impossible to track them using this method.
  • Regularly querying the source database to identify new and updated rows can overload it.

Table differences

This method identifies the difference between the source and the destination tables to detect new, updated, and even deleted rows. The difference can be calculated using a SQL query or specific utilities provided by the database (for example, SQL Server provides a tablediff utility).

Key challenges:

  • Comparing tables row-by-row to identify differences requires extensive computational resources, and it’s not scalable.

Database triggers (Trigger-based CDC)

This method requires the creation of database triggers with logic to manage the metadata within the same table or in a separate book-keeping table, often called a shadow table.

Most databases allow the creation of triggers; you can see how to create a trigger for PostgreSQL.

Key challenges:

  • If a transaction fails, roll-back logic may need to be implemented to remove the operation from the shadow table.
  • The trigger needs to be modified in case of table schema changes.
  • Triggers cannot be reused for other databases, given the differences in SQL language.
  • The use of triggers can slow down the transactional workload.

Database transaction log (Log-based CDC)

Log-based CDC uses the transaction logs that some databases – such as Postgres, MySQL, SQL Server,  and Oracle – implement natively as part of their core functionality.

Log-based and trigger-based CDC are very similar – both keep a log of changes every time a database operation happens – so the shadow table and the transaction log contain the same information. The difference between log-based and trigger-based CDC is that the first one uses a core functionality of the database (transaction log); meanwhile, the triggers are created and defined by the user.

Since database logs are updated in every transaction, the experience is transparent, which means log-based CDC does not require any logical changes in database objects or the application running on top of the database. A system reads data directly from the database Change Data Capture logs to identify changes in a database, minimizing the impact of the capture process.

Key challenges:

  • Some database operations are not captured in the CDC logs, such as ALTER or TRUNCATE. In that case, additional logic needs to be configured to force the logging of those operations.
  • If the destination datastore is down, transaction logs must be kept intact until the subsequent replication happens.

Summary

As you can see, there are several approaches to implementing Change Data Capture. Before databases included native CDC, data engineers used techniques such as table differences, table metadata, and database triggers to capture changes made to a database. These methods, however, can be inefficient and intrusive, and they tend to impose overhead on source databases.

As a result, many modern and real-time data architectures employ log-based CDC, which uses a background process to scan database transaction logs for changed data. Transactions are unaffected, and the impact on source databases is minimal. It’s also easy to implement, provided you have the right Change Data Capture tool.

Reference