Amazon Athena

Amazon Athena is a serverless, interactive query service that runs standard SQL directly against data in Amazon S3 — no cluster to provision, no data to load. It's powered by Trino (for SQL) and Apache Spark (for notebook workloads) and integrates with the AWS Glue Data Catalog for schema and table metadata.

Key Features:

Serverless SQL over S3: Issue SELECT queries against Parquet, ORC, Avro, JSON, CSV, Iceberg, and Hudi tables without provisioning compute.
Two Pricing Models: Per-TB scanned on the default engine, or per-DPU-hour via provisioned capacity (Workgroups) for predictable-cost workloads.
Glue Data Catalog Integration: Tables defined once in Glue are queryable from Athena, Redshift Spectrum, EMR, and SageMaker.
Federated Queries: Connectors query RDS, DynamoDB, Redshift, DocumentDB, HBase, and custom sources — join S3 data with operational stores in one query.
Iceberg & Hudi Support: ACID tables with time travel, schema evolution, row-level updates/deletes.
Athena for Spark: Notebook-style PySpark on the same data lake, managed without an EMR cluster.
CTAS & INSERT: Create partitioned, columnar tables from query results to materialize aggregates and accelerate downstream queries.
Lake Formation Governance: Fine-grained row-, column-, and cell-level access control.

Cost & Performance Best Practices:

Partition tables on common filter columns (date, region) to prune scanned data.
Use columnar formats (Parquet, ORC) and Snappy/ZSTD compression — often 5–10× less scanned bytes than JSON/CSV.
Project only needed columns in SELECT — Athena bills on bytes scanned, not rows returned.
Avoid SELECT * on large tables; prefer SELECT col_a, col_b.
Compact small files periodically — Athena performs poorly over millions of tiny files.

Example Query:


SELECT
    date_trunc('day', event_time)                  AS day,
    region,
    count_if(status = 'error')                     AS errors,
    count(*)                                       AS total
FROM   logs.app_events
WHERE  year = '2026' AND month = '04'
GROUP BY 1, 2
ORDER BY day, region;

Athena vs. Redshift Spectrum vs. EMR:

Athena: Best for ad-hoc, interactive SQL over S3 — serverless and pay-per-query.
Redshift Spectrum: Queries S3 from within Redshift — pick it when you already have a Redshift warehouse and want to join warehouse tables with lake data.
EMR / EMR Serverless: Full Spark/Hive/Presto workloads — pick it for heavy, long-running ETL or when you need engine-level control.

Athena is the everyday entry point to the S3-based data lake — it lets analysts and engineers query raw and curated data with SQL without standing up any infrastructure.