Databricks CDC (Change Data Capture)

Change Data Capture (CDC) in Databricks is a pattern for processing only the changes (inserts, updates, deletes) from source systems instead of repeatedly reloading full tables. This enables near real-time analytics, efficient data pipelines, and simpler auditability on the Databricks Lakehouse.

What Is Change Data Capture?

Why Use CDC?

CDC in the Databricks Lakehouse

Databricks typically organizes CDC pipelines using the Bronze / Silver / Gold layering pattern on top of Delta Lake tables:

  1. Bronze: Raw, ingested CDC events (from logs, queues, or files).
  2. Silver: Cleaned, merged, de-duplicated tables with current state and history.
  3. Gold: Business-ready aggregates, marts, and feature tables.

Key Databricks Building Blocks for CDC

Typical CDC Architectures on Databricks

1. Log-Based CDC Into Databricks

In this pattern, a separate CDC tool reads database logs and pushes changes to Databricks.

  1. A CDC tool (e.g., Debezium, Fivetran, etc.) captures database changes (INSERT, UPDATE, DELETE).
  2. Changes are written as events to:
  3. Databricks Structured Streaming or Auto Loader ingests these events into a Bronze Delta table.
  4. Silver tables apply business logic:
  5. Gold tables build aggregates and reporting layers.

2. Using Delta Change Data Feed (CDF)

When your source is already a Delta table, you can use Change Data Feed to read only changed rows instead of scanning the full table.

3. Batch CDC from Snapshots (Pseudo-CDC)

If the source system provides only full snapshots, Databricks can compute changes between current and previous snapshots.

  1. Ingest each snapshot into a Bronze Delta table with a snapshot_date.
  2. Compute differences between the latest and previous snapshots:
  3. Apply results as MERGE operations to the Silver table.

How CDC Is Applied in Delta Tables

Upserts Using MERGE

A common CDC pattern in Databricks is to MERGE change events into a target Delta table:

Slowly Changing Dimensions (SCD) with CDC

For dimensional models, Databricks CDC is often combined with SCD Type 1 and SCD Type 2 patterns:

CDC with Delta Live Tables (DLT)

Delta Live Tables can simplify CDC implementations by managing dependencies, ordering, and fault tolerance for you.

Best Practices for Databricks CDC

Summary