Feature | Delta Live Tables (DLT) | Managed Tables | External Tables |
---|---|---|---|
Data Management | Managed pipelines with automation for data ingestion, transformation, and output | Fully managed by Databricks | Data stored externally, metadata managed by Databricks |
Storage Location | Can use managed or external storage | Databricks File System (DBFS) or default cloud storage | External storage (e.g., S3, Blob, HDFS) |
Data Lifecycle | Lifecycle managed by DLT pipelines | Data is deleted when the table is dropped | Data remains after the table is dropped |
Use Case | Automated ETL pipelines and real-time data processing | Temporary or internal datasets managed by Databricks | Persistent or shared datasets |
Automation & Monitoring | Automated pipeline execution, monitoring, and quality checks | No automation for tasks | No automation for tasks |
Delta Live Tables (DLT) is a framework designed for building and managing ETL pipelines. It automates data processing, handling dependencies, and optimizing workflows in both batch and streaming data pipelines.
import dlt
from pyspark.sql.functions import *
@dlt.table
def clean_data():
return spark.read("path/to/raw_data").filter(col("age") > 18)
Managed tables are fully controlled by Databricks. Databricks manages both the metadata and data storage. When you drop a managed table, both the metadata and the underlying data files are deleted.
CREATE TABLE my_managed_table (
id INT,
name STRING
) USING DELTA;
External tables allow users to store data externally and only manage metadata within Databricks. The actual data remains in external storage, such as AWS S3, Azure Blob, or HDFS.
CREATE TABLE my_external_table (
id INT,
name STRING
) USING DELTA LOCATION '/mnt/external_data/';