Medallion Architecture


Storing Data in the Bronze Layer

Medallion Architecture Best Practices

Benefits of Keeping Original Format:


Cost Savings


Overview of Layers


Key Benefits

In summary, the Medallion Architecture not only enhances data quality, scalability, and governance but also results in considerable cost savings by optimizing data storage, processing, and resource allocation across the entire data pipeline.


Example Code for Data Ingestion in Python

# Ingesting raw data into the Bronze layer
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BronzeLayerIngestion").getOrCreate()

# Read data from different formats
json_df    = spark.read.json("/path/to/source/data.json")
parquet_df = spark.read.parquet("/path/to/source/data.parquet")
excel_df   = spark.read.format("com.crealytics.spark.excel") \\
    .option("useHeader", "true") \\
    .load("/path/to/source/data.xlsx")

# Write data to Bronze layer
json_df.write.format("delta").mode("append").save("/path/to/bronze/json")
parquet_df.write.format("delta").mode("append").save("/path/to/bronze/parquet")
excel_df.write.format("delta").mode("append").save("/path/to/bronze/excel")
    

SQL Query for Verifying Bronze Data

-- Verify data ingestion in the Bronze layer
SELECT COUNT(*) AS record_count, source_format
FROM bronze_layer_table
GROUP BY source_format;

Expected Output

| record_count | source_format |
|--------------|---------------|
|    5000      |   JSON        |
|   12000      |   Parquet     |
|     800      |   Excel       |