Databricks Asset Bundles (DAB)

Databricks Asset Bundles are the platform's first-party answer to "how do I treat my workspace like infrastructure?" A bundle is a directory containing a databricks.yml manifest and the source files it references — notebooks, Python wheels, DLT pipeline definitions, ML experiments, dashboards. The Databricks CLI reads the manifest, resolves variables and target overrides, and deploys the whole package to a workspace as a coherent unit. The same bundle definition produces dev, staging, and prod environments with only target-level differences.

Bundles replace the older "deploy notebooks with a Bash script and configure jobs in the UI" workflow with declarative IaC that is checked into source control, code-reviewed, and applied through CI/CD.


1. What a Bundle Is

A bundle is a directory tree like this:


my_bundle/
├── databricks.yml              # the manifest
├── README.md
├── src/
│   ├── ingest_orders.py
│   ├── transform_silver.py
│   └── dlt_pipeline.py
├── notebooks/
│   └── exploration.ipynb
├── resources/
│   ├── orders_job.yml          # split out for readability
│   └── orders_pipeline.yml
└── tests/
    └── test_transform_silver.py

Anything you can configure in the workspace UI — jobs, DLT pipelines, ML model registrations, model serving endpoints, AI/BI dashboards, SQL warehouses, secret scopes — can be declared in YAML and shipped via a bundle. Workspaces become deterministic and disposable.

2. The databricks.yml Structure

The manifest has four top-level blocks: bundle, workspace, resources, and targets. A minimal manifest looks like this:


bundle:
  name: orders_pipeline

workspace:
  host: https://adb-1234567890123456.7.azuredatabricks.net

variables:
  catalog:
    description: Unity Catalog target catalog
    default: dev_orders
  notifications_email:
    default: data-eng@example.com

resources:
  jobs:
    orders_etl:
      name: orders_etl_${bundle.target}
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./src/ingest_orders.py
          job_cluster_key: shared_cluster
        - task_key: transform
          depends_on:
            - task_key: ingest
          notebook_task:
            notebook_path: ./src/transform_silver.py
          job_cluster_key: shared_cluster
      job_clusters:
        - job_cluster_key: shared_cluster
          new_cluster:
            spark_version: 15.4.x-scala2.12
            node_type_id: Standard_D4ds_v5
            num_workers: 2
      email_notifications:
        on_failure: [${var.notifications_email}]

targets:
  dev:
    mode: development
    default: true
    workspace:
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
    variables:
      catalog: dev_orders

  prod:
    mode: production
    workspace:
      root_path: /Shared/.bundle/${bundle.name}/prod
    run_as:
      service_principal_name: 0123abcd-...-....
    variables:
      catalog: prod_orders
      notifications_email: oncall@example.com

Key concepts:

3. Targets: dev / staging / prod

The mode field on a target is the most consequential setting. It changes how the CLI rewrites resources before deploy:

This makes it easy to databricks bundle deploy from a laptop into a personal sandbox while the same bundle goes through CI for production.

4. Resource Patterns

Multi-task Job

Already shown above. The pattern that matters: declare job_clusters once and reference them via job_cluster_key from each task so all tasks share warm compute.

DLT Pipeline


resources:
  pipelines:
    orders_dlt:
      name: orders_dlt_${bundle.target}
      catalog: ${var.catalog}
      target: silver
      libraries:
        - notebook:
            path: ./src/dlt_pipeline.py
      configuration:
        bronze.path: /Volumes/${var.catalog}/raw/orders
      clusters:
        - label: default
          autoscale:
            min_workers: 1
            max_workers: 4
      development: false
      photon: true
      serverless: true

Model Serving Endpoint


resources:
  model_serving_endpoints:
    fraud_scorer:
      name: fraud-scorer-${bundle.target}
      config:
        served_entities:
          - entity_name: ${var.catalog}.ml.fraud_model
            entity_version: "7"
            workload_size: Small
            scale_to_zero_enabled: true
        traffic_config:
          routes:
            - served_model_name: fraud_model-7
              traffic_percentage: 100
        auto_capture_config:
          catalog_name: ${var.catalog}
          schema_name: inference_logs
          table_name_prefix: fraud_scorer

MLflow Experiment


resources:
  experiments:
    fraud_training:
      name: /Shared/experiments/fraud_training_${bundle.target}
      tags:
        - key: owner
          value: ml-platform

5. Deploy and Run from the CLI

The CLI ships as databricks. Bundle commands are under databricks bundle.


# Validate the manifest, resolve variables, show what will be deployed
databricks bundle validate --target dev

# Upload sources and apply resources to the dev target
databricks bundle deploy --target dev

# Trigger a job by its resource key (not its display name)
databricks bundle run orders_etl --target dev

# Tail logs from the most recent run of a job
databricks bundle run orders_etl --target dev --no-wait
databricks jobs list-runs --job-id $(databricks bundle summary --target dev -o json | jq -r '.resources.jobs.orders_etl.id')

# Promote to production
databricks bundle deploy --target prod

# Tear down everything the bundle created in a target
databricks bundle destroy --target dev

6. CI/CD with GitHub Actions

The standard pattern is: validate on every PR, deploy to staging on merge to main, deploy to prod on a tagged release. Authenticate with an OAuth machine-to-machine (M2M) service principal token stored as a GitHub secret.


name: deploy-bundle

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
    tags: ['v*']

jobs:
  validate:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
      - run: databricks bundle validate --target staging
        env:
          DATABRICKS_HOST: $
          DATABRICKS_CLIENT_ID: $
          DATABRICKS_CLIENT_SECRET: $

  deploy-staging:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
      - run: databricks bundle deploy --target staging
        env:
          DATABRICKS_HOST: $
          DATABRICKS_CLIENT_ID: $
          DATABRICKS_CLIENT_SECRET: $

  deploy-prod:
    if: startsWith(github.ref, 'refs/tags/v')
    runs-on: ubuntu-latest
    environment: production    # require manual approval
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
      - run: databricks bundle deploy --target prod
        env:
          DATABRICKS_HOST: $
          DATABRICKS_CLIENT_ID: $
          DATABRICKS_CLIENT_SECRET: $

The environment: production line ties the prod job to a GitHub Environment, which can require reviewer approval before the step runs — a cheap and effective production guardrail.

7. Bundle Templates

For new projects, bootstrap with databricks bundle init, which scaffolds a bundle from a template. Built-in templates exist for default Python jobs, DLT pipelines, MLOps stacks, and dbt projects, and you can author custom templates as Go templates over a YAML config.


# Use a built-in template
databricks bundle init default-python

# Use a custom template from a git repo
databricks bundle init https://github.com/example/my-dab-template

8. Bundles vs the Terraform Databricks Provider

The Terraform Databricks provider has been the IaC tool for Databricks for years. It is broader in scope — it manages workspace creation, networking, accounts, and other control-plane resources. Asset Bundles cover only what lives inside a workspace, but they cover it more idiomatically.

Concern Asset Bundles Terraform Databricks Provider
Scope In-workspace assets (jobs, pipelines, dashboards, model serving). Workspaces, accounts, networking, plus all in-workspace assets.
State management State stored in a workspace folder; no external backend to operate. Standard Terraform state — you operate S3/Azure Blob/Cloud Storage backends.
Source bundling Native — the CLI uploads notebooks/wheels alongside the resources. Manual — you script the artifact upload yourself.
Dev experience mode: development auto-namespaces and pauses schedules. You build the dev/prod separation yourself with workspaces or modules.

The pragmatic split: Terraform owns the platform (workspace creation, metastore, account-level groups). Bundles own the application (this team's jobs, pipelines, ML models, dashboards). The two compose well — Terraform stands up the workspace, then each application team ships into it with their own bundle and CI/CD.


Common Interview Questions:

What are Databricks Asset Bundles and what problem do they solve?

DABs package the code, configuration, and resource definitions for a Databricks project (jobs, DLT pipelines, ML experiments, dashboards) into a single declarative YAML file that can be version-controlled, reviewed, and deployed via CLI. Before bundles, projects were assembled by clicking through the UI or piecing together databricks-cli calls in shell scripts — there was no single source of truth and no way to diff a deployment. DABs make a Databricks project look and behave like a normal application repository.

DAB vs Terraform — when do you reach for which?

Terraform is for platform infrastructure: workspaces, metastores, account-level groups, network configuration — things that exist once per environment and rarely change. DABs are for application artifacts: the jobs, pipelines, and notebooks that a data team ships every sprint. Terraform's resource graph is too heavy for daily app deploys, and DABs do not model account-level objects. The clean split is Terraform owns the workspace, DABs own what runs inside it.

How do bundle targets work and what are they for?

Targets (dev, staging, prod) are named overrides that let one bundle deploy to multiple environments with different cluster sizes, storage paths, schedule cadences, and run-as identities. The base configuration sits at the top level; each target overrides only the fields it needs to change, and you select one with databricks bundle deploy -t prod. The pattern keeps the application code identical across environments while letting cost and access policies vary — exactly the same idea as Helm values files.

How do you wire a bundle into CI/CD?

The standard pipeline: on PR, run databricks bundle validate and unit tests against a dev target; on merge to main, deploy to staging with databricks bundle deploy -t staging and run integration tests via databricks bundle run; on tagged release, deploy to prod. Authenticate using OAuth machine-to-machine credentials (service principals) stored as GitHub Actions secrets — never personal access tokens. Pin the databricks-cli version in the workflow so a CLI release does not silently change deployment behavior.

How should secrets be handled in a bundle?

Never put secrets in bundle YAML — even with variable substitution they end up in plaintext in the workspace's deployed artifact. The right pattern is to reference Databricks secret scopes from job configuration () and have the secrets themselves provisioned out-of-band by Terraform or a dedicated secret-sync job pulling from AWS Secrets Manager / Azure Key Vault. The bundle ships the reference; the platform provides the value.

How do you handle shared library code across multiple bundles?

Build the shared code as a Python wheel published to a private package index (Artifactory, AWS CodeArtifact, or a Unity Catalog volume), then reference it as a library dependency in each bundle's job clusters. Avoid the temptation to copy a notebook into multiple repos — drift is inevitable. For very early-stage shared code, a UC volume holding the latest wheel works fine; graduate to a versioned package index once more than two teams depend on it.


↑ Back to Top