Performance Optimization — Databricks & Tableau

Concise, interview-ready points in priority order.

What steps would you take to optimize the performance of a Databricks job?

Use Photon runtime & autoscaling clusters.
Store in Delta/Parquet, compact files, use Z-ORDER.
Enable Adaptive Query Execution (AQE).
Tune shuffle partitions & use broadcast joins.
Filter/prune columns early, cache selectively.
Prefer built-ins over UDFs.
Monitor Spark UI & fix bottlenecks.

How would you handle data skew in a Databricks job?

Detect with Spark UI (slow tasks, uneven partitions).
Enable AQE skew handling.
Broadcast small tables, pre-aggregate before joins.
Apply salting or composite keys.
Repartition by uniform keys or process hot keys separately.

What are the best practices for writing efficient ETL pipelines in Databricks?

Use Delta tables with Auto Loader for ingestion.
Build bronze → silver → gold layers, incremental not full reloads.
Optimize/Z-ORDER for query speed.
Parameterize code, add data quality checks.
Avoid UDFs, reduce shuffles, prune columns early.
Orchestrate with retries, alerts, checkpoints.

How would you optimize a Tableau dashboard for performance when dealing with large datasets?

Use extracts (Hyper), aggregated/incremental refresh.
Pre-aggregate data, reduce marks (<100k per view).
Use context filters, avoid high-cardinality quick filters.
Push heavy calcs upstream; limit complex LODs.
Keep dashboards simple, fewer worksheets.
Monitor with Performance Recording & query plans.