Data Pipelines — Ingestion, ETL, and Streaming

Data Pipelines

Modern data pipelines span the full spectrum from nightly batch ETL jobs to always-on real-time streaming, capturing raw events from operational systems and landing them — clean, partitioned, and queryable — into a warehouse or lakehouse. The pages below walk through the canonical tooling at each stage: ingestion, the streaming spine, columnar storage, and the transformation patterns that turn raw bytes into analytics-ready tables.

Core Topics

ETL Pipeline

The canonical Extract / Transform / Load pattern, its typical stages, and where modern ELT on cloud warehouses now beats classic ETL by pushing transforms down to the engine.

Large-Scale Ingestion

High-throughput ingest design at billion-row scale: parallelism, partitioning, backpressure, and idempotent loads that survive retries without duplicating data.

Apache NiFi

Flow-based dataflow tool with visual DAGs, pluggable processors, and built-in provenance — a strong fit between raw heterogeneous sources and the warehouse.

Kafka Producer/Consumer

The Kafka log abstraction, producer and consumer semantics, partition keys, and why Kafka is the streaming backbone of nearly every modern data stack.

Apache Parquet

Columnar file format powering lakehouses: column pruning, predicate pushdown, dictionary and run-length compression — the lingua franca of cloud object storage.

Collibra & Data Quality

Governance and data-quality enforcement at pipeline boundaries — Collibra rules validate ingested data before it lands in the warehouse.

Related on this site

AWS Glue ETL

Serverless Spark ETL on AWS — managed crawlers, jobs, and catalog. The default ETL service for S3-based data lakes.

AWS Glue Data Catalog

The metastore that ties S3 + Athena + Redshift Spectrum + EMR together — Hive-compatible metadata for the entire AWS analytics stack.

AWS Glue Workflow

Multi-step Glue orchestration — chained crawlers and jobs with triggers, the AWS-native alternative to Airflow for simple ETL DAGs.

Kinesis Data Streams

AWS-native streaming primitive — partitioned shards, sub-second latency, retention up to a year. The Kafka alternative inside AWS.

Kinesis Data Firehose

Managed delivery of streaming data into S3/Redshift/Splunk — the zero-code path from event source to lake landing zone.

Delta Live Tables

Declarative Databricks pipelines — pure SQL/Python definitions, the runtime handles incrementalization, retries, data quality.

PySpark Data Streaming

Structured Streaming on Databricks — micro-batch and continuous models over Kafka/files, the same DataFrame API as batch.

PySpark Lazy Evaluation

Why PySpark transformations build a DAG and only execute on action — the foundation of Catalyst optimization for ETL pipelines.

Snowflake Cortex RAG Workflow

End-to-end ingest → embed → index → retrieve pipeline inside Snowflake — a modern AI pipeline atop the warehouse.

About this section: the pages flow naturally from source to sink. Operational systems and external feeds enter through ingestion tools like Apache NiFi or custom large-scale loaders, then ride the streaming spine on Kafka topics for durable, replayable transport. From there, events land in columnar storage as Parquet files on S3 or ADLS, and finally an ETL/ELT layer transforms and merges those raw files into curated tables inside Snowflake, Databricks, or BigQuery for analytics and machine learning.