Databricks
Delta Live (DLT), Managed, External Tables

Key Differences:

Feature Delta Live Tables (DLT) Managed Tables External Tables
Data Management Managed pipelines with automation for data ingestion, transformation, and output Fully managed by Databricks Data stored externally, metadata managed by Databricks
Storage Location Can use managed or external storage Databricks File System (DBFS) or default cloud storage External storage (e.g., S3, Blob, HDFS)
Data Lifecycle Lifecycle managed by DLT pipelines Data is deleted when the table is dropped Data remains after the table is dropped
Use Case Automated ETL pipelines and real-time data processing Temporary or internal datasets managed by Databricks Persistent or shared datasets
Automation & Monitoring Automated pipeline execution, monitoring, and quality checks No automation for tasks No automation for tasks

1. Delta Live Tables (DLT)

Delta Live Tables (DLT) is a framework designed for building and managing ETL pipelines. It automates data processing, handling dependencies, and optimizing workflows in both batch and streaming data pipelines.

Example of Delta Live Tables Pipeline:


import dlt
from pyspark.sql.functions import *

@dlt.table
def clean_data():
    return spark.read("path/to/raw_data").filter(col("age") > 18)
    

2. Managed Tables

Managed tables are fully controlled by Databricks. Databricks manages both the metadata and data storage. When you drop a managed table, both the metadata and the underlying data files are deleted.

Example of Creating a Managed Table:


CREATE TABLE my_managed_table (
    id INT,
    name STRING
) USING DELTA;
    

3. External Tables

External tables allow users to store data externally and only manage metadata within Databricks. The actual data remains in external storage, such as AWS S3, Azure Blob, or HDFS.

Example of Creating an External Table:


CREATE TABLE my_external_table (
    id INT,
    name STRING
) USING DELTA LOCATION '/mnt/external_data/';