Lakehouse

The lakehouse stack stores raw data as Parquet (or Avro / ORC) files on object storage, then layers an open table format — Hudi, Iceberg, Delta — on top to add ACID transactions, time travel, schema evolution, and incremental processing. A separate catalog (Polaris, Unity, Nessie, Glue) tells engines where the tables live, and a query engine (Trino, StarRocks, Spark, Flink) reads them. This page maps the open ecosystem.

The Big Three Open Table Formats

Apache Hudi

Record-level upserts with merge-on-read and copy-on-write tables for streaming-friendly incremental ingest. Originally built at Uber for CDC and slowly-changing dimensions on the data lake.

Apache Iceberg

Metadata-tree design with manifests and snapshots, an engine-neutral catalog, hidden partitioning, and time travel. Adopted by Snowflake, Databricks, and AWS as the de-facto open warehouse format.

Delta Lake

Transaction-log on Parquet originated at Databricks and donated to the Linux Foundation. Native on Databricks, increasingly first-class on Snowflake, Microsoft Fabric, and AWS.

Catalogs & Governance

Apache Polaris

Open-source REST catalog for Iceberg, originally from Snowflake and donated to Apache in 2024. Implements the Iceberg REST spec with credential vending and multi-tenancy.

Unity Catalog

Databricks’ unified governance layer, open-sourced 2024. Tables, files, ML models, and functions in one RBAC-aware namespace; supports both Delta and Iceberg.

Project Nessie

Git-for-data catalog for Iceberg — branches, tags, atomic multi-table commits, and time travel by reference. Brings the review-and-promote workflow to data.

Query Engines for the Lakehouse

Trino & Presto

Distributed SQL engines with native readers for Iceberg, Hudi, Delta, and 30+ other sources. The de-facto open query layer for the lakehouse and the engine behind Starburst.

StarRocks & Apache Doris

MPP analytical engines with native lakehouse readers, materialized views, and sub-second response under high concurrency — tuned for customer-facing BI.

Interoperability & Streaming

Apache XTable

Metadata-only translation between Hudi, Iceberg, and Delta — same Parquet files, three different format readers. Sidesteps the “format wars” entirely.

Apache Paimon

Streaming-first table format from the Flink community. LSM-tree storage tuned for high-frequency CDC and primary-key UPSERT, where snapshot-based formats struggle.

Emerging Formats

Lance

Columnar format optimized for ML — cheap random access, native vector type with HNSW/IVF indexes, zero-copy into PyTorch. The format behind LanceDB.

DuckLake

2025 lakehouse from the DuckDB team. Catalog metadata in a regular SQL database (Postgres / SQLite / DuckDB), data as Parquet on S3 — the simplest lakehouse design shipping today.

Related on this site

Delta Lake on Databricks

Databricks-specific Delta patterns — the Lakehouse Platform’s native runtime, optimization commands, and DBR integration.

Managed vs External Tables

When Databricks owns the data lifecycle (managed) vs. when it just queries existing storage (external) — and the trade-offs for governance and portability.

Live Tables Comparison

Delta Live Tables flavors — managed, external, and live — and when each fits a streaming or batch lakehouse pipeline.

DLT SCD Type 2

Slowly-changing dimensions on a lakehouse — historical tracking via DLT, the open-format counterpart to classic warehouse SCD2.

Medallion Architecture

Bronze → Silver → Gold layered design on a lakehouse. The canonical pattern for refining raw events into curated fact tables.

Apache Parquet

The columnar file format that sits underneath Hudi / Iceberg / Delta. Column pruning, predicate pushdown, dictionary and run-length compression.

Amazon S3

The object store underneath every cloud lakehouse. Storage classes, lifecycle policies, and the durability/cost properties that table formats build on.

AWS Glue Data Catalog

Hive-compatible metastore that lets Athena, Redshift, and Spark see Hudi/Iceberg/Delta tables under a single catalog — the Linux of cloud metastores.


File Format vs. Table Format

The single most-missed distinction in the lakehouse: Parquet, ORC, and Avro are file formats — how a single file lays out columns and rows on disk. Hudi, Iceberg, and Delta are table formats — the transactional metadata layer that turns a directory of Parquet files into a queryable table with ACID semantics, time travel, and schema evolution. A table format almost always uses Parquet underneath; switching file format (Parquet → ORC) is rare, while switching table format (Hudi → Iceberg) is increasingly common.

The query engine reads the table format’s metadata to decide which Parquet files to read for a query, then hands those files to its Parquet reader. Without the table format, you have a directory of files. Without the file format, you have nothing — the table format always builds on a file format.


About this section. The Big Three open table formats overlap heavily but lean different ways: Hudi is upsert-first and streaming-friendly, Iceberg is engine-neutral and metadata-scalable, Delta is the most polished read/write experience inside Databricks. Catalogs (Polaris, Unity, Nessie) are increasingly where vendor differentiation moves to as the formats themselves converge. Query engines (Trino, StarRocks, Doris) keep the lakehouse honest as an open alternative to vertically integrated warehouses. Track Paimon and Lance as workload-specific specialists, and watch DuckLake for the smallest-possible-lakehouse story.

↑ Back to Top