Performance Optimization — Databricks & Tableau
Concise, interview-ready points in priority order.
What steps would you take to optimize the performance of a Databricks job?
- Use Photon runtime & autoscaling clusters.
- Store in Delta/Parquet, compact files, use Z-ORDER.
- Enable Adaptive Query Execution (AQE).
- Tune shuffle partitions & use broadcast joins.
- Filter/prune columns early, cache selectively.
- Prefer built-ins over UDFs.
- Monitor Spark UI & fix bottlenecks.
How would you handle data skew in a Databricks job?
- Detect with Spark UI (slow tasks, uneven partitions).
- Enable AQE skew handling.
- Broadcast small tables, pre-aggregate before joins.
- Apply salting or composite keys.
- Repartition by uniform keys or process hot keys separately.
What are the best practices for writing efficient ETL pipelines in Databricks?
- Use Delta tables with Auto Loader for ingestion.
- Build bronze → silver → gold layers, incremental not full reloads.
- Optimize/Z-ORDER for query speed.
- Parameterize code, add data quality checks.
- Avoid UDFs, reduce shuffles, prune columns early.
- Orchestrate with retries, alerts, checkpoints.
How would you optimize a Tableau dashboard for performance when dealing with large datasets?
- Use extracts (Hyper), aggregated/incremental refresh.
- Pre-aggregate data, reduce marks (<100k per view).
- Use context filters, avoid high-cardinality quick filters.
- Push heavy calcs upstream; limit complex LODs.
- Keep dashboards simple, fewer worksheets.
- Monitor with Performance Recording & query plans.