Change Data Capture (CDC)
A technique for identifying and capturing changes made to a database in real time by reading the database's transaction log, enabling downstream systems to react to data changes as they occur.
CDC reads the write-ahead log (WAL) or binary log of a database to capture every insert, update, and delete as it happens. Tools like Debezium, AWS DMS, and Fivetran use CDC to stream database changes to downstream systems without impacting the source database's performance. This is far more efficient than periodic full-table scans to detect changes.
The captured change events contain the before and after state of each row, along with metadata like the transaction ID and timestamp. These events can be streamed to message queues (Kafka), data warehouses, search indexes, or caches, keeping downstream systems synchronized with the source database in near-real time.
For AI systems, CDC enables real-time feature updates. When a user updates their profile or makes a purchase, CDC captures the change and streams it to the feature store, ensuring that the next model prediction uses the latest information. CDC also enables incremental model training data pipelines, where only new and changed records are processed rather than reprocessing the entire dataset.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.