Database Schema Design

Database schema design is the process of translating logical data models (conceptual ERDs) into physical database structures optimized for performance, integrity, scalability, and maintainability. It bridges the gap between what data you need and how it's actually stored and accessed in the database.

1. Overview: From Logic to Physical

Schema design involves critical decisions:

Performance: Indexing, partitioning, denormalization for query speed
Storage: Normalization trade-offs, compression, compression strategies
Data Integrity: Constraints, referential integrity, validation rules
Scalability: Horizontal partitioning, sharding keys, time-series partitioning
Maintainability: Naming conventions, documentation, change management

2. Schema Types: OLTP vs OLAP

OLTP (Online Transaction Processing) schemas prioritize write performance and normalization:

3NF or higher normalization to minimize data redundancy
Many small tables with fine-grained relationships
Frequent INSERT/UPDATE/DELETE operations
Examples: Banking, e-commerce, CRM systems

OLAP (Online Analytical Processing) schemas prioritize query performance via denormalization:

Dimensional modeling (Star, Snowflake, Galaxy schemas)
Fewer tables with wider fact tables
Heavy SELECT queries, minimal writes
Examples: Data warehouses, analytics platforms, business intelligence

2a. Star Schema

The most common OLAP pattern. A central fact table (containing measures and foreign keys) surrounded by dimension tables (describing attributes).


        ┌─────────────────┐
        │  dim_customer   │
        │  - cust_id (PK) │
        └─────────────────┘
                │ FK
                │
    ┌───────────┼───────────┐
    │           │           │
┌─────────────────────────────────────┐
│        fact_sales (FACT)            │
│  - sale_id (PK)                     │
│  - cust_id (FK)                     │
│  - prod_id (FK)                     │
│  - store_id (FK)                    │
│  - date_id (FK)                     │
│  - quantity (measure)               │
│  - revenue (measure)                │
└─────────────────────────────────────┘
    │           │           │
    │           │           │
┌─────────────┐ │ ┌──────────────┐
│ dim_product │ │ │  dim_store   │
└─────────────┘ │ └──────────────┘
                │
            ┌──────────┐
            │ dim_date │
            └──────────┘

2b. Snowflake Schema

Dimensions are normalized into multiple related tables, reducing redundancy but increasing query complexity.


fact_sales ──┬──> dim_product ──> dim_category
             ├──> dim_customer ──> dim_region
             ├──> dim_store
             └──> dim_date

Each dimension table references other dimension tables,
eliminating attribute duplication.

2c. Galaxy (Constellation) Schema

Multiple fact tables sharing common dimensions. Used for complex analytical domains with different granularities.


fact_sales ──────┬──> dim_product
                 ├──> dim_customer
                 └──> dim_date

fact_returns ────┬──> dim_product
                 ├──> dim_customer
                 └──> dim_date

Shared dimensions across multiple fact tables.

3. Star Schema Deep Dive

Star schemas excel in query performance because:

Fact tables: Store measures (SUM, AVG, COUNT) and foreign key references
Grain: The level of detail (e.g., one row per sales transaction)
Dimension tables: Contain descriptive attributes and surrogate keys (synthetic PKs for performance)
SCD (Slowly Changing Dimensions): Type 1 (overwrite), Type 2 (versioning), Type 3 (previous value)

Complete Star Schema Example: Retail Analytics


-- Dimension: Customer
CREATE TABLE dim_customer (
  cust_key SERIAL PRIMARY KEY,           -- Surrogate key
  cust_id VARCHAR(50) NOT NULL UNIQUE,   -- Business key
  cust_name VARCHAR(100) NOT NULL,
  email VARCHAR(100),
  country VARCHAR(50),
  city VARCHAR(50),
  is_active BOOLEAN DEFAULT TRUE,
  scd_version INT DEFAULT 1,             -- For SCD Type 2
  effective_date DATE DEFAULT CURRENT_DATE,
  end_date DATE DEFAULT '9999-12-31',
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_dim_customer_cust_id ON dim_customer(cust_id);

-- Dimension: Product
CREATE TABLE dim_product (
  prod_key SERIAL PRIMARY KEY,
  prod_id VARCHAR(50) NOT NULL UNIQUE,
  prod_name VARCHAR(100) NOT NULL,
  category VARCHAR(50) NOT NULL,
  subcategory VARCHAR(50),
  brand VARCHAR(50),
  list_price DECIMAL(10, 2),
  is_active BOOLEAN DEFAULT TRUE,
  scd_version INT DEFAULT 1,
  effective_date DATE DEFAULT CURRENT_DATE,
  end_date DATE DEFAULT '9999-12-31'
);
CREATE INDEX idx_dim_product_category ON dim_product(category);

-- Dimension: Store
CREATE TABLE dim_store (
  store_key SERIAL PRIMARY KEY,
  store_id VARCHAR(50) NOT NULL UNIQUE,
  store_name VARCHAR(100) NOT NULL,
  region VARCHAR(50) NOT NULL,
  district VARCHAR(50),
  country VARCHAR(50),
  city VARCHAR(50),
  state VARCHAR(10),
  postal_code VARCHAR(10),
  store_type VARCHAR(20),
  opened_date DATE,
  is_active BOOLEAN DEFAULT TRUE
);
CREATE INDEX idx_dim_store_region ON dim_store(region);

-- Dimension: Date (conformed dimension)
CREATE TABLE dim_date (
  date_key INT PRIMARY KEY,              -- YYYYMMDD format
  full_date DATE NOT NULL UNIQUE,
  year INT,
  quarter INT,
  month INT,
  day_of_month INT,
  day_of_week INT,
  week_of_year INT,
  is_weekday BOOLEAN,
  is_holiday BOOLEAN DEFAULT FALSE,
  holiday_name VARCHAR(50)
);

-- Fact Table: Sales
CREATE TABLE fact_sales (
  sale_id BIGSERIAL PRIMARY KEY,
  cust_key INT NOT NULL REFERENCES dim_customer(cust_key),
  prod_key INT NOT NULL REFERENCES dim_product(prod_key),
  store_key INT NOT NULL REFERENCES dim_store(store_key),
  date_key INT NOT NULL REFERENCES dim_date(date_key),
  quantity INT NOT NULL CHECK (quantity > 0),
  unit_price DECIMAL(10, 2) NOT NULL,
  discount_pct DECIMAL(5, 4) DEFAULT 0,
  revenue DECIMAL(12, 2) GENERATED ALWAYS AS
    (quantity * unit_price * (1 - discount_pct)) STORED,
  cost DECIMAL(12, 2),
  profit GENERATED ALWAYS AS (revenue - cost) STORED,
  tax DECIMAL(10, 2),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Indexing strategy for fact table
CREATE INDEX idx_fact_sales_cust_key ON fact_sales(cust_key);
CREATE INDEX idx_fact_sales_prod_key ON fact_sales(prod_key);
CREATE INDEX idx_fact_sales_store_key ON fact_sales(store_key);
CREATE INDEX idx_fact_sales_date_key ON fact_sales(date_key);
CREATE INDEX idx_fact_sales_composite ON fact_sales(date_key, store_key, prod_key);

-- Analytical Query Example
SELECT
  dc.cust_name,
  dp.prod_name,
  EXTRACT(YEAR FROM dd.full_date) AS year,
  EXTRACT(MONTH FROM dd.full_date) AS month,
  SUM(fs.quantity) AS total_units,
  SUM(fs.revenue) AS total_revenue,
  ROUND(AVG(fs.profit), 2) AS avg_profit
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
JOIN dim_product dp ON fs.prod_key = dp.prod_key
JOIN dim_store ds ON fs.store_key = ds.store_key
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE ds.region = 'North America'
  AND EXTRACT(YEAR FROM dd.full_date) = 2024
  AND dp.category = 'Electronics'
GROUP BY dc.cust_name, dp.prod_name, year, month
HAVING SUM(fs.revenue) > 10000
ORDER BY total_revenue DESC;

4. Snowflake Schema

Normalizes dimension tables to eliminate redundancy. Pros: smaller storage, easier maintenance. Cons: more joins, slower queries.


-- Snowflake Dimension: Product hierarchy
CREATE TABLE dim_brand (
  brand_key SERIAL PRIMARY KEY,
  brand_id VARCHAR(50) NOT NULL UNIQUE,
  brand_name VARCHAR(100) NOT NULL,
  country_of_origin VARCHAR(50)
);

CREATE TABLE dim_category (
  category_key SERIAL PRIMARY KEY,
  category_id VARCHAR(50) NOT NULL UNIQUE,
  category_name VARCHAR(100) NOT NULL,
  category_desc TEXT
);

CREATE TABLE dim_product_snowflake (
  prod_key SERIAL PRIMARY KEY,
  prod_id VARCHAR(50) NOT NULL UNIQUE,
  prod_name VARCHAR(100) NOT NULL,
  category_key INT NOT NULL REFERENCES dim_category(category_key),
  brand_key INT NOT NULL REFERENCES dim_brand(brand_key),
  list_price DECIMAL(10, 2)
);

-- Snowflake query (more joins, but smaller tables)
SELECT
  dc.category_name,
  db.brand_name,
  COUNT(*) AS num_products,
  AVG(dps.list_price) AS avg_price
FROM dim_product_snowflake dps
JOIN dim_category dc ON dps.category_key = dc.category_key
JOIN dim_brand db ON dps.brand_key = db.brand_key
GROUP BY dc.category_name, db.brand_name
ORDER BY num_products DESC;

5. Indexing Strategies

Indexes accelerate data retrieval but slow writes. Choose carefully based on query patterns.

5a. B-tree Index (Default)

Balanced tree structure, excellent for range queries and sorting.


-- Simple B-tree on column
CREATE INDEX idx_employee_salary ON employees(salary);

-- Range query benefits from B-tree
SELECT * FROM employees
WHERE salary BETWEEN 50000 AND 100000
  AND hire_date > '2020-01-01'
ORDER BY salary;

-- Composite (multi-column) B-tree
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);

-- Composite query
SELECT customer_id, order_date, total
FROM orders
WHERE customer_id = 123 AND order_date >= '2024-01-01'
ORDER BY order_date DESC;

5b. Hash Index

Perfect for equality lookups, not range queries.


-- Hash index (PostgreSQL)
CREATE INDEX idx_user_email USING HASH ON users(email);

-- Excellent for exact match
SELECT * FROM users WHERE email = 'john@example.com';

-- Poor for range queries
SELECT * FROM users WHERE email LIKE '%@example.com';  -- Slow with hash index

5c. GiST / GIN (PostgreSQL)

Generalized Search Tree for full-text search, JSON, geometric data.


-- Full-text search with GIN
CREATE INDEX idx_article_content_gin
  ON articles USING GIN(to_tsvector('english', content));

SELECT id, title FROM articles
WHERE to_tsvector('english', content) @@ to_tsquery('database & optimization');

-- JSON document search
CREATE INDEX idx_metadata_gin ON events USING GIN(metadata);

SELECT * FROM events WHERE metadata @> '{"event_type": "purchase"}';

5d. Bitmap Index

Efficient for low-cardinality columns (few distinct values).


-- Bitmap ideal for status, gender, region fields
-- Bitmap indexes are typically created implicitly in columnar stores
-- Example query pattern (in Redshift, Snowflake, Databricks):

SELECT region, status, COUNT(*) AS cnt
FROM orders
WHERE is_active = TRUE
  AND region IN ('North', 'South')
GROUP BY region, status;

5e. Covering Index

Includes non-key columns so query can be satisfied without accessing the table.


-- Covering index (PostgreSQL, SQL Server)
CREATE INDEX idx_customer_name_email
  ON customers(customer_id) INCLUDE (name, email);

-- Query satisfied entirely from index (index-only scan)
SELECT name, email FROM customers WHERE customer_id = 456;

5f. Partial Index

Index only rows matching a condition, saving space.


-- Partial index for active customers only
CREATE INDEX idx_active_customers
  ON customers(email) WHERE status = 'active';

-- Query uses partial index
SELECT * FROM customers
WHERE status = 'active' AND email = 'john@example.com';

-- Partial index for recent orders
CREATE INDEX idx_pending_orders
  ON orders(order_date DESC)
  WHERE status = 'pending';

6. Partitioning

Splitting large tables into smaller physical pieces for performance, maintainability, and parallelism.

6a. Range Partitioning (Time-series)

Most common in data warehouses and time-series systems.


-- PostgreSQL Range Partitioning
CREATE TABLE fact_events (
  event_id BIGSERIAL,
  user_id INT,
  event_type VARCHAR(50),
  event_timestamp TIMESTAMP,
  properties JSONB,
  PRIMARY KEY (event_id, event_timestamp)
) PARTITION BY RANGE (DATE_TRUNC('month', event_timestamp));

-- Create partitions for each month
CREATE TABLE fact_events_2024_01 PARTITION OF fact_events
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE fact_events_2024_02 PARTITION OF fact_events
  FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Databricks SQL (Delta Lake)
CREATE TABLE fact_events (
  event_id BIGINT,
  user_id INT,
  event_type STRING,
  event_timestamp TIMESTAMP,
  properties MAP<STRING, STRING>
)
PARTITIONED BY (year INT, month INT)
USING DELTA
CLUSTER BY (user_id);

-- Insert automatically partitions
INSERT INTO fact_events VALUES
  (1, 100, 'click', '2024-03-15 10:30:00', map(), 2024, 3);

6b. List Partitioning

Partition by discrete values (region, country, tenant_id).


-- PostgreSQL List Partitioning
CREATE TABLE orders (
  order_id BIGSERIAL,
  customer_id INT,
  region VARCHAR(50),
  order_date DATE,
  amount DECIMAL(10, 2),
  PRIMARY KEY (order_id, region)
) PARTITION BY LIST (region);

CREATE TABLE orders_north PARTITION OF orders
  FOR VALUES IN ('North', 'Northeast', 'Northwest');

CREATE TABLE orders_south PARTITION OF orders
  FOR VALUES IN ('South', 'Southeast', 'Southwest');

CREATE TABLE orders_central PARTITION OF orders
  FOR VALUES IN ('Central', 'Midwest');

6c. Hash Partitioning

Distribute rows evenly across partitions using a hash function.


-- PostgreSQL Hash Partitioning
CREATE TABLE user_events (
  event_id BIGSERIAL,
  user_id INT,
  event_type VARCHAR(50),
  event_timestamp TIMESTAMP,
  PRIMARY KEY (event_id, user_id)
) PARTITION BY HASH (user_id);

-- 4 hash partitions
CREATE TABLE user_events_0 PARTITION OF user_events
  FOR VALUES WITH (modulus 4, remainder 0);

CREATE TABLE user_events_1 PARTITION OF user_events
  FOR VALUES WITH (modulus 4, remainder 1);

-- ...create 2 and 3 similarly

6d. Composite Partitioning

Combine multiple partitioning strategies (range + hash).


-- Databricks: Partition by date, cluster by user_id
CREATE TABLE fact_sales_composite (
  sale_id BIGINT,
  customer_id INT,
  product_id INT,
  sale_date DATE,
  amount DECIMAL(12, 2)
)
PARTITIONED BY (year INT, month INT)
USING DELTA
CLUSTER BY (customer_id);

-- Query prunes partitions first, then uses clustering
SELECT SUM(amount) FROM fact_sales_composite
WHERE year = 2024 AND month = 3 AND customer_id = 100;

7. Constraints & Integrity

Constraints enforce data quality at the database level, preventing invalid states.


CREATE TABLE organizations (
  org_id SERIAL PRIMARY KEY,                          -- PRIMARY KEY
  org_name VARCHAR(100) NOT NULL,                     -- NOT NULL
  email VARCHAR(100) UNIQUE,                          -- UNIQUE (allows NULL)
  industry VARCHAR(50),
  founded_year INT CHECK (founded_year >= 1900),      -- CHECK
  status VARCHAR(20) DEFAULT 'active',                -- DEFAULT
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE subscriptions (
  subscription_id SERIAL PRIMARY KEY,
  org_id INT NOT NULL,
  plan VARCHAR(20) DEFAULT 'basic',
  monthly_seats INT CHECK (monthly_seats > 0),
  billing_email VARCHAR(100) NOT NULL,

  -- FOREIGN KEY with CASCADE delete
  CONSTRAINT fk_org FOREIGN KEY (org_id)
    REFERENCES organizations(org_id) ON DELETE CASCADE,

  -- Unique constraint on composite columns
  UNIQUE(org_id, plan),

  -- CHECK constraint across columns
  CHECK (monthly_seats <= 1000)
);

CREATE TABLE invoices (
  invoice_id SERIAL PRIMARY KEY,
  subscription_id INT,
  invoice_date DATE NOT NULL,
  amount DECIMAL(10, 2) NOT NULL CHECK (amount > 0),
  status VARCHAR(20) DEFAULT 'pending',
  paid_date DATE,

  CONSTRAINT fk_subscription FOREIGN KEY (subscription_id)
    REFERENCES subscriptions(subscription_id)
    ON DELETE SET NULL,                                -- SET NULL on parent delete

  CHECK (status IN ('pending', 'paid', 'cancelled')),
  CHECK (paid_date IS NULL OR paid_date >= invoice_date)
);

-- Referential integrity examples
-- ON DELETE options:
--   CASCADE: delete child rows
--   SET NULL: set FK to NULL
--   RESTRICT: prevent deletion if children exist
--   SET DEFAULT: set FK to default value

CREATE TABLE audit_log (
  log_id SERIAL PRIMARY KEY,
  org_id INT,
  action VARCHAR(50) NOT NULL,
  changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  CONSTRAINT fk_audit_org FOREIGN KEY (org_id)
    REFERENCES organizations(org_id)
    ON DELETE RESTRICT                                -- Prevent org deletion if audit log exists
);

8. Example: SaaS Multi-Tenant Application Schema

Design pattern for Software-as-a-Service with organization isolation and partitioning on org_id.


-- Organizations (tenants)
CREATE TABLE organizations (
  org_id SERIAL PRIMARY KEY,
  org_name VARCHAR(100) NOT NULL,
  country VARCHAR(50),
  industry VARCHAR(50),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_org_name ON organizations(org_name);

-- Users within organizations
CREATE TABLE users (
  user_id SERIAL PRIMARY KEY,
  org_id INT NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
  email VARCHAR(100) NOT NULL,
  password_hash VARCHAR(255),
  full_name VARCHAR(100),
  role VARCHAR(20) DEFAULT 'member',
  is_active BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  UNIQUE (org_id, email)                             -- Email unique per org
);
CREATE INDEX idx_users_org_id ON users(org_id);

-- Subscriptions / Plans
CREATE TABLE subscriptions (
  subscription_id SERIAL PRIMARY KEY,
  org_id INT NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
  plan_type VARCHAR(50) DEFAULT 'starter',
  billing_cycle VARCHAR(20) DEFAULT 'monthly',
  monthly_seats INT DEFAULT 10,
  price_per_month DECIMAL(10, 2),
  subscription_date DATE NOT NULL,
  next_billing_date DATE,
  status VARCHAR(20) DEFAULT 'active',
  UNIQUE (org_id)                                    -- One sub per org
);

-- Invoices
CREATE TABLE invoices (
  invoice_id SERIAL PRIMARY KEY,
  org_id INT NOT NULL REFERENCES organizations(org_id) ON DELETE CASCADE,
  subscription_id INT NOT NULL REFERENCES subscriptions(subscription_id),
  invoice_date DATE NOT NULL,
  amount DECIMAL(10, 2) NOT NULL,
  status VARCHAR(20) DEFAULT 'pending',
  paid_date DATE,
  due_date DATE,
  payment_method VARCHAR(50)
)
PARTITION BY LIST (org_id);                          -- Partition by tenant for isolation

-- Usage events (time-series, high cardinality)
CREATE TABLE usage_events (
  event_id BIGSERIAL,
  org_id INT NOT NULL,
  user_id INT,
  event_type VARCHAR(50),
  event_timestamp TIMESTAMP NOT NULL,
  properties JSONB,
  PRIMARY KEY (event_id, org_id, event_timestamp)
)
PARTITION BY RANGE (DATE_TRUNC('month', event_timestamp));

-- Audit log with org isolation
CREATE TABLE audit_log (
  log_id BIGSERIAL PRIMARY KEY,
  org_id INT NOT NULL,
  user_id INT,
  action VARCHAR(100),
  resource_type VARCHAR(50),
  resource_id INT,
  old_values JSONB,
  new_values JSONB,
  timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (DATE_TRUNC('month', timestamp));

-- Row-level security policy (PostgreSQL example)
ALTER TABLE subscriptions ENABLE ROW LEVEL SECURITY;

CREATE POLICY org_isolation ON subscriptions
  USING (org_id = current_setting('app.current_org_id')::INT);

9. Example: Analytics Data Warehouse

Complete dimensional model for web analytics with multiple fact tables (Galaxy schema).


-- Conformed Dimension: Date
CREATE TABLE dim_date (
  date_key INT PRIMARY KEY,
  full_date DATE NOT NULL UNIQUE,
  year INT, quarter INT, month INT, day_of_month INT,
  day_of_week INT, week_of_year INT,
  is_weekend BOOLEAN, is_holiday BOOLEAN
);

-- Dimension: User / Visitor
CREATE TABLE dim_user (
  user_key SERIAL PRIMARY KEY,
  user_id VARCHAR(100) NOT NULL UNIQUE,
  first_visit_date DATE,
  country VARCHAR(50),
  state VARCHAR(50),
  city VARCHAR(50),
  device_type VARCHAR(20),
  os VARCHAR(50),
  browser VARCHAR(50),
  cohort_month DATE,
  lifetime_value DECIMAL(12, 2),
  is_active BOOLEAN DEFAULT TRUE
);

-- Dimension: Page
CREATE TABLE dim_page (
  page_key SERIAL PRIMARY KEY,
  page_id VARCHAR(200) NOT NULL UNIQUE,
  page_url TEXT NOT NULL,
  page_title VARCHAR(255),
  section VARCHAR(50),
  subsection VARCHAR(50),
  page_type VARCHAR(50),
  is_indexed BOOLEAN DEFAULT TRUE
);

-- Dimension: Campaign
CREATE TABLE dim_campaign (
  campaign_key SERIAL PRIMARY KEY,
  campaign_id VARCHAR(100) NOT NULL UNIQUE,
  campaign_name VARCHAR(255),
  channel VARCHAR(50),
  source VARCHAR(100),
  medium VARCHAR(50),
  campaign_start_date DATE,
  campaign_end_date DATE,
  budget DECIMAL(10, 2)
);

-- Fact Table: Page Views (grain = one row per page view)
CREATE TABLE fact_page_views (
  pageview_id BIGSERIAL PRIMARY KEY,
  user_key INT NOT NULL REFERENCES dim_user(user_key),
  page_key INT NOT NULL REFERENCES dim_page(page_key),
  campaign_key INT REFERENCES dim_campaign(campaign_key),
  date_key INT NOT NULL REFERENCES dim_date(date_key),
  time_of_day TIME,
  session_id VARCHAR(100),
  time_on_page INT,                                 -- seconds
  scroll_depth DECIMAL(5, 2),                       -- percent
  bounce INT CHECK (bounce IN (0, 1)),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (date_key);

-- Indexes on fact_page_views
CREATE INDEX idx_fpv_user ON fact_page_views(user_key);
CREATE INDEX idx_fpv_page ON fact_page_views(page_key);
CREATE INDEX idx_fpv_date ON fact_page_views(date_key);
CREATE INDEX idx_fpv_campaign ON fact_page_views(campaign_key);

-- Fact Table: Events (grain = one row per user interaction)
CREATE TABLE fact_events (
  event_id BIGSERIAL PRIMARY KEY,
  user_key INT NOT NULL REFERENCES dim_user(user_key),
  page_key INT NOT NULL REFERENCES dim_page(page_key),
  date_key INT NOT NULL REFERENCES dim_date(date_key),
  event_type VARCHAR(50) NOT NULL,
  event_name VARCHAR(100),
  event_value DECIMAL(10, 2),
  event_category VARCHAR(50),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (date_key);

CREATE INDEX idx_fe_user ON fact_events(user_key);
CREATE INDEX idx_fe_event_type ON fact_events(event_type);
CREATE INDEX idx_fe_date ON fact_events(date_key);

-- Analytical Queries

-- Page performance analysis
SELECT
  dp.section,
  dp.page_title,
  COUNT(DISTINCT fpv.pageview_id) AS page_views,
  COUNT(DISTINCT fpv.user_key) AS unique_visitors,
  ROUND(100.0 * SUM(fpv.bounce) / COUNT(*), 2) AS bounce_rate,
  ROUND(AVG(fpv.time_on_page), 0) AS avg_time_seconds,
  ROUND(AVG(fpv.scroll_depth), 2) AS avg_scroll_depth
FROM fact_page_views fpv
JOIN dim_page dp ON fpv.page_key = dp.page_key
JOIN dim_date dd ON fpv.date_key = dd.date_key
WHERE dd.full_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY dp.section, dp.page_title
ORDER BY page_views DESC
LIMIT 20;

-- Campaign effectiveness
SELECT
  dc.campaign_name,
  dc.channel,
  COUNT(DISTINCT fpv.user_key) AS users_reached,
  COUNT(DISTINCT fe.event_id) AS total_events,
  SUM(CASE WHEN fe.event_type = 'purchase' THEN fe.event_value ELSE 0 END)
    AS revenue,
  ROUND(100.0 * COUNT(CASE WHEN fe.event_type = 'purchase' THEN 1 END)
    / NULLIF(COUNT(DISTINCT fpv.user_key), 0), 2) AS conversion_rate
FROM fact_page_views fpv
LEFT JOIN fact_events fe ON fpv.user_key = fe.user_key
  AND fpv.date_key = fe.date_key
JOIN dim_campaign dc ON fpv.campaign_key = dc.campaign_key
JOIN dim_date dd ON fpv.date_key = dd.date_key
WHERE dd.full_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY dc.campaign_name, dc.channel
ORDER BY revenue DESC;

-- User cohort analysis
SELECT
  du.cohort_month,
  dd.full_date - du.first_visit_date AS days_since_first_visit,
  COUNT(DISTINCT fpv.user_key) AS active_users,
  COUNT(DISTINCT fe.event_id) AS events,
  ROUND(AVG(du.lifetime_value), 2) AS avg_ltv
FROM fact_page_views fpv
JOIN fact_events fe ON fpv.user_key = fe.user_key AND fpv.date_key = fe.date_key
JOIN dim_user du ON fpv.user_key = du.user_key
JOIN dim_date dd ON fpv.date_key = dd.date_key
WHERE du.cohort_month IS NOT NULL
  AND dd.full_date >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY du.cohort_month, 2
ORDER BY du.cohort_month, 2;

10. Performance Optimization

Schema design is only part of optimization; query execution matters equally.

10a. EXPLAIN and Query Plans

Understand how the database executes your query.


-- PostgreSQL EXPLAIN ANALYZE
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT
  dc.cust_name,
  SUM(fs.revenue) AS total_revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
WHERE EXTRACT(YEAR FROM fs.created_at) = 2024
GROUP BY dc.cust_name
HAVING SUM(fs.revenue) > 5000
ORDER BY total_revenue DESC;

/*
Output (example):
  Finalize GroupAggregate (cost=5000.00..6000.00 rows=100)
  -> Gather Merge (cost=4500.00..5500.00 rows=200)
    -> Sort (cost=4500.00..4600.00 rows=200)
      -> Partial GroupAggregate (cost=3500.00..4500.00 rows=200)
        -> Hash Join (cost=1500.00..3000.00 rows=50000) <-- JOIN
          -> Seq Scan on fact_sales (cost=0.00..1000.00 rows=100000)
              Filter: (EXTRACT(YEAR ...) = 2024)  <-- FILTER
          -> Hash (cost=500.00..500.00 rows=5000)
            -> Seq Scan on dim_customer (cost=0.00..500.00)

Key observations:
- Seq Scan on fact_sales: might benefit from date index
- Hash Join: good for large joins
- Seq Scan on small dim_customer: expected, dimension table is small
*/

-- Add index to improve performance
CREATE INDEX idx_fact_sales_created_at ON fact_sales(created_at);

-- Re-run EXPLAIN to see improvement
EXPLAIN (ANALYZE, BUFFERS)
SELECT ...  -- same query as above

/*
With index, the planner will use Index Scan instead of Seq Scan,
reducing I/O significantly for large tables.
*/

10b. Materialized Views for Pre-aggregation

Pre-compute expensive aggregations and refresh incrementally.


-- Materialized view for daily sales summary
CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT
  dd.full_date,
  ds.region,
  COUNT(*) AS num_transactions,
  SUM(fs.quantity) AS total_units,
  SUM(fs.revenue) AS total_revenue,
  ROUND(AVG(fs.profit), 2) AS avg_profit
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key
JOIN dim_store ds ON fs.store_key = ds.store_key
GROUP BY dd.full_date, ds.region;

-- Create index on materialized view
CREATE INDEX idx_mv_daily_sales_date ON mv_daily_sales_summary(full_date);

-- Refresh periodically (full or incremental)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_sales_summary;

-- Fast queries against pre-aggregated data
SELECT full_date, region, total_revenue
FROM mv_daily_sales_summary
WHERE full_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY full_date DESC;

10c. Denormalization Trade-offs

Strategically duplicate data to avoid joins.


-- Normalized (joins required)
SELECT
  fs.sale_id,
  dc.cust_name,
  dc.email,
  dp.prod_name,
  fs.quantity,
  fs.revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
JOIN dim_product dp ON fs.prod_key = dp.prod_key
WHERE fs.date_key = 20240315;

-- Denormalized (no joins, but redundant data)
CREATE TABLE fact_sales_denorm AS
SELECT
  fs.sale_id,
  dc.cust_name,
  dc.email,
  dp.prod_name,
  dp.category,
  fs.quantity,
  fs.revenue,
  fs.date_key
FROM fact_sales fs
JOIN dim_customer dc ON fs.cust_key = dc.cust_key
JOIN dim_product dp ON fs.prod_key = dp.prod_key;

-- Much faster for OLAP queries, but harder to maintain
-- Use only if join cost is prohibitive and data refresh is infrequent

10d. Connection Pooling and Scaling

Manage database connections efficiently.


-- PgBouncer configuration example (pgbouncer.ini)
[databases]
analytics_db = host=db.internal port=5432 dbname=analytics

[pgbouncer]
pool_mode = transaction              -- Return connection after each transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3

-- Typical production setup:
-- - Read replicas for analytical queries
-- - Write to primary, replicate to replicas
-- - Use connection pooler between app and DB
-- - Enable query caching layer (Redis, Memcached)

10e. Table Statistics and Maintenance

Keep statistics fresh for optimal query planning.


-- Analyze tables to update statistics
ANALYZE fact_sales;
ANALYZE dim_customer;

-- Vacuum to reclaim space and cleanup dead tuples
VACUUM ANALYZE fact_sales;

-- Schedule regular maintenance
-- In cron or scheduled job:
-- 0 2 * * * vacuum analyze fact_sales;
-- 0 3 * * * reindex index idx_fact_sales_date;

Summary

Effective database schema design is the foundation of data system performance. Key principles:

Choose the right schema type: Star for OLAP, normalized for OLTP
Index strategically: B-tree for ranges, hash for equality, GIN for full-text
Partition large tables: Range by time, list by region/tenant, hash for distribution
Enforce integrity: Use constraints to prevent bad data
Monitor and optimize: EXPLAIN ANALYZE, materialized views, denormalization
Document everything: Naming conventions, constraint logic, partitioning strategy