Liquid Clustering in Delta Lake

Liquid Clustering is Delta Lake's modern data layout strategy — the successor to Hive-style partitioning and to Z-Ordering. It addresses two longstanding pain points in Lakehouse engineering: partitioning's brittleness when access patterns change, and Z-Order's expensive, all-or-nothing rewrites. Liquid Clustering is incremental, has no fixed boundaries, and lets you change cluster columns without rebuilding the table.

For most new Delta tables, Liquid Clustering is the right default. For most existing tables that suffer from small-files or partition skew, it is the right migration target.

1. Why Hive-Style Partitioning Fails

Hive-style partitioning physically subdivides a table into directories by the values of one or more columns: /orders/year=2025/month=04/day=22/. This works beautifully when queries always filter on the partition columns and the cardinality is moderate. It collapses in three failure modes that show up at scale:

Small files. A daily partition with 50 source files of 2 MB each yields 50 small files, which crush metadata operations and increase task startup overhead. By the second year of operation, the table has tens of thousands of tiny files.
Partition skew. Real-world data is not uniform — one customer or one country dominates the volume. Skewed partitions become straggler tasks and bottleneck the job.
Evolving access patterns. A table partitioned by region works until product asks for queries by customer_segment, at which point every query becomes a full scan. Repartitioning is a full rewrite.

Once partition cardinality goes above a few thousand distinct values, or once any column with skew is used as a partition, performance degrades. The fix is usually to stop partitioning and let the data layout be managed automatically.

2. Z-Order Limitations

Z-Order was Delta's first-generation answer to "I want fast filters on more than one column without partitioning by all of them." It uses a space-filling curve to colocate rows by multiple columns within each file, then relies on Delta's data-skipping min/max statistics to prune files at query time.

Two structural limitations:

All-or-nothing rewrite. OPTIMIZE ZORDER BY rewrites every file in scope to apply the curve. New data added since the last OPTIMIZE is unsorted until the next full pass, which is expensive on large tables.
Rigid column choice. The set of Z-Order columns is encoded in the optimization command, not in table metadata. Operators have to remember to use the same columns each run, and changing the set means another full rewrite.

Z-Order is still useful in narrow situations (one-shot historical optimization on a static table) but it is no longer the recommended pattern for evolving tables.

3. What Liquid Clustering Is

Liquid Clustering replaces both partitioning and Z-Order with a single mechanism. The cluster columns are stored in table metadata. Files are grouped into clusters by those columns, and Delta maintains the layout incrementally as data is written or compacted — new files are clustered in place, and only files that need rebalancing are touched. There are no directory boundaries, no skew traps, and no rewrite of the entire table when the layout drifts.

Properties to internalize:

Incremental. OPTIMIZE only rewrites files that need it — not the whole table.
Multi-column without curse-of-dimensionality. Up to four cluster columns work effectively, vs Z-Order's typical sweet spot of two.
No partition boundaries. Eliminates skew and small-files-by-construction.
Cluster-column changes are cheap. Run ALTER TABLE ... CLUSTER BY (...) and subsequent writes use the new columns; an OPTIMIZE migrates older data over time.
Concurrency-friendly. Liquid uses row-level concurrency (in Photon-enabled engines) so concurrent writers don't conflict on the same logical cluster.

4. Enabling Liquid Clustering

On a New Table

Use CLUSTER BY in CREATE TABLE. Do not also use PARTITIONED BY — the two are mutually exclusive on the same table.


CREATE TABLE main.sales.orders (
  order_id        STRING NOT NULL,
  customer_id     STRING NOT NULL,
  order_ts        TIMESTAMP NOT NULL,
  region          STRING,
  product_sku     STRING,
  amount_usd      DECIMAL(18,2),
  status          STRING
)
CLUSTER BY (customer_id, order_ts);

Picking customer_id first and order_ts second optimizes for the common access pattern "give me everything for customer X in the last 30 days," which is the dominant pattern in transactional analytics.

On an Existing Table

Convert in two steps: declare cluster columns, then run OPTIMIZE to migrate the data layout. The conversion is online — readers and writers continue to operate.


-- Cannot apply CLUSTER BY directly to a partitioned table without first dropping the partition spec.
-- For an unpartitioned table, you can simply alter:
ALTER TABLE main.sales.orders
  CLUSTER BY (customer_id, order_ts);

-- Migrate existing files to the new layout incrementally
OPTIMIZE main.sales.orders;

To change cluster columns later, alter the table again. New writes immediately use the new columns; subsequent OPTIMIZE runs migrate older files.


ALTER TABLE main.sales.orders
  CLUSTER BY (region, order_ts);

OPTIMIZE main.sales.orders;

To stop clustering entirely (rare):


ALTER TABLE main.sales.orders CLUSTER BY NONE;

Python Equivalents


# Create a clustered table from a DataFrame using DataFrameWriterV2
(df.writeTo("main.sales.orders")
   .using("delta")
   .clusterBy("customer_id", "order_ts")
   .create())

# Or via the spark.sql shorthand
spark.sql("""
  CREATE TABLE main.sales.orders_v2 (
    order_id STRING, customer_id STRING, order_ts TIMESTAMP, amount_usd DECIMAL(18,2)
  )
  CLUSTER BY (customer_id, order_ts)
""")

# Run OPTIMIZE
spark.sql("OPTIMIZE main.sales.orders")

5. Choosing Cluster Columns

The rule of thumb: pick the columns that appear in WHERE clauses or join keys for the queries you care about. Within that set, prefer:

High-cardinality columns where data skipping can prune lots of files. customer_id in a multi-tenant table; device_id in IoT.
Time columns when queries have a recency filter. event_ts, order_ts.
Columns commonly joined together. Clustering both sides of a join on the same key helps shuffle-free joins.

Avoid:

Free-text columns or anything that's not used for filtering.
Booleans or low-cardinality enum columns — they don't help skipping.
More than four columns — the marginal benefit drops sharply.

6. Running OPTIMIZE and Auto-Clustering

Liquid Clustering is not "set and forget" — the layout improves only when data is rewritten. There are three ways to drive that:

Manual OPTIMIZE. Schedule it nightly or after large batch loads. Cheap, because it only touches files that need rebalancing.
Predictive Optimization. A workspace-level setting (Unity Catalog managed tables) that automatically schedules OPTIMIZE and VACUUM based on observed query patterns and write activity. Recommended.
Auto-clustering during writes. Recent Delta versions cluster newly written files in place so the layout stays close to optimal even without an explicit OPTIMIZE.


-- Enable Predictive Optimization at the schema level (UC managed tables)
ALTER SCHEMA main.sales ENABLE PREDICTIVE OPTIMIZATION;

-- Or schedule via a job
OPTIMIZE main.sales.orders;

A typical before/after on a 2 TB orders table clustered by (customer_id, order_ts):


-- Before clustering: full scan
SELECT SUM(amount_usd)
FROM main.sales.orders
WHERE customer_id = 'C-7841'
  AND order_ts >= current_date() - INTERVAL 30 DAYS;
-- 2 TB scanned, ~45 s

-- After CLUSTER BY (customer_id, order_ts) + OPTIMIZE
-- Same query: 1.4 GB scanned, ~1.2 s

7. Partitioning vs Z-Order vs Liquid

Property	Partitioning	Z-Order	Liquid Clustering
Layout boundaries	Hard directory boundaries	Within-file sort, no dir boundaries	Soft cluster groups, no dir boundaries
Cardinality limit	Low (thousands)	High	High
Skew handling	Poor	Better	Good
Incremental rewrite	N/A	No (full rewrite per OPTIMIZE)	Yes
Change layout columns	Full table rewrite	Full table rewrite	ALTER + incremental migrate
Concurrency	File-level conflicts	File-level conflicts	Row-level concurrency
Coexists with the others	—	Yes (with partitioning)	No (mutually exclusive)

8. When Partitioning Still Wins

Liquid Clustering is the new default, but partitioning has not been deleted from the toolbox. Two scenarios where partitioning is still the right choice:

Date-only access patterns over years of history. A logs or events table queried almost exclusively as "give me a single day" or "scan a date range" benefits from a single low-cardinality date partition. Pruning is exact, partition listings are cheap, and there is no clustering overhead. Crucially, a date partition gives you near-trivial drop-old-data semantics: ALTER TABLE ... DROP PARTITION (date='2023-01-01') is metadata-only.
Regulatory data residency. When a contract or law requires data for region X to live in a specific physical location, partitioning by region (with each partition pointed at a region-specific storage location) gives you a hard, auditable boundary. Liquid's soft cluster groups don't satisfy that requirement.

Outside those cases, default to Liquid Clustering on new tables and migrate existing high-cardinality or skewed partitioned tables to it.

Common Interview Questions:

What is Liquid Clustering and how is it different from partitioning?

Liquid Clustering is a Delta feature that physically organizes table data by chosen clustering keys without creating directory-level partitions. Unlike Hive-style partitioning, it adapts to high-cardinality keys without producing tiny files, supports changing the clustering keys after the table is in production, and avoids the small-files problem that kills over-partitioned tables. It is the right default for most new tables on modern Databricks.

Liquid Clustering vs Z-Order — when does each win?

Z-Order is a clustering operation applied periodically by OPTIMIZE ... ZORDER BY on a partitioned table; clustering quality decays between runs and the operation rewrites entire partitions. Liquid Clustering is incremental — new writes are placed near existing clusters automatically — and there is no partition column, so high-cardinality keys are safe. Liquid wins on continuously-loaded tables, mutable tables, and high-cardinality keys; Z-Order is still acceptable on stable date-partitioned append-only tables built before Liquid existed.

When should you still partition a Delta table instead of using Liquid Clustering?

Two cases. First, when you genuinely need partition-level operations: drop a year of data with DROP PARTITION, run a job that processes one date at a time, or apply differential retention policies per partition. Second, very large append-only tables with a single low-cardinality date filter on every query — directory partitioning gives the planner free file-skipping with zero metadata cost. Outside these, Liquid is the better default; partitioning becomes a footgun the moment cardinality exceeds a few hundred values per partition column.

How many clustering keys should you specify and how do you pick them?

Up to four keys, ordered by selectivity in your query workload — the columns that appear most often in WHERE and JOIN predicates. Picking too many dilutes the clustering benefit because the data has to be organized in too many dimensions; picking the wrong ones means OPTIMIZE rewrites the table without speeding up queries. Use the query history in Databricks SQL to find the actual filter columns rather than guessing — it is almost always 1-2 columns plus a date.

What does it cost to change the clustering keys on an existing Liquid table?

The ALTER TABLE ... CLUSTER BY (...) statement itself is metadata-only and instantaneous — no data is rewritten. New writes start clustering on the new keys immediately. Existing files keep their old organization until the next OPTIMIZE, which incrementally re-clusters them. That is the headline win over partitioning: re-partitioning a multi-terabyte table requires a full rewrite and a maintenance window, while changing Liquid keys is a single statement that re-converges over time.

How do you tell whether Liquid Clustering is actually helping a query?

Look at the Spark UI's scan node for the table — specifically the "files read" and "bytes read" vs "files in table" metrics. If a filter on a clustering key reads only a few percent of files, clustering is working; if it reads most files, either the key is wrong or OPTIMIZE has not run recently. Also check the explain plan for "PartitionFilters" / "DataFilters" being pushed down. Run DESCRIBE DETAIL to see the current clustering columns and the last optimization timestamp; if the table has had heavy writes since the last OPTIMIZE, schedule one.

↑ Back to Top