Database Normalization

Normalization is the process of organizing columns and tables in a relational database to reduce redundancy and eliminate undesirable characteristics such as insertion, update, and deletion anomalies. It was introduced by Edgar F. Codd in a series of papers from 1970 onward, starting with the seminal "A Relational Model of Data for Large Shared Data Banks." Codd defined the first three normal forms; later researchers (Raymond Boyce, Ronald Fagin, and others) added stricter forms. Normalization is primarily applied to OLTP systems where data integrity and write performance matter. Analytical systems often deliberately denormalize for read speed.

1. Overview

At its core, normalization is a set of guidelines — actually, a set of mathematical definitions — that tell you how to decompose tables so that each fact is stored in exactly one place. When a fact appears in only one row, it cannot become inconsistent with itself. That is the whole motivation.

Codd's relational model gave the database community three things at once:

A data model (tables, tuples, attributes) grounded in set theory and first-order predicate logic.
A query language family (relational algebra and calculus) that led to SQL.
A design theory — normalization — for producing schemas that avoid redundancy-driven bugs.

Normalization matters because without it, the same fact is stored in many places. When facts diverge, queries return contradictory answers depending on where they look. Normalization is not about saving bytes — modern disks are cheap. It is about preventing contradictions and making correctness a structural property of the schema rather than a discipline the application layer has to enforce.

The normal forms form a hierarchy. Each higher form includes the constraints of those below it:


     6NF   (temporal / irreducible)
      |
     5NF   (project-join)
      |
     4NF   (multi-valued dependency)
      |
     BCNF  (every determinant is a key)
      |
     3NF   (no transitive dependency)
      |
     2NF   (no partial dependency)
      |
     1NF   (atomic values)
      |
  Unnormalized Form (UNF)

DKNF (Domain-Key Normal Form) sits to the side — it is a theoretical ideal rather than a step in the same chain.

2. Update Anomalies: The Problem Normalization Solves

Before discussing the normal forms themselves, look at the problems they exist to solve. Consider a single denormalized table tracking employees, their departments, and the projects they work on:


employee_projects
+--------+----------+---------+-----------+-----------+--------------+----------+
| emp_id | emp_name | dept_id | dept_name | dept_head | project_code | hours    |
+--------+----------+---------+-----------+-----------+--------------+----------+
| 101    | Alice    | D10     | Research  | Dr. Smith | P-ALPHA      | 20       |
| 101    | Alice    | D10     | Research  | Dr. Smith | P-BETA       | 15       |
| 102    | Bob      | D10     | Research  | Dr. Smith | P-ALPHA      | 35       |
| 103    | Carol    | D20     | Sales     | Ms. Jones | P-GAMMA      | 40       |
| 104    | Dave     | D20     | Sales     | Ms. Jones | P-GAMMA      | 30       |
+--------+----------+---------+-----------+-----------+--------------+----------+

Three distinct pathologies appear in this single table.

2a. Insertion Anomaly

Suppose a new department D30 Marketing is formed but no one has been hired into it yet. There is no way to record the department's existence: the primary key must include emp_id and project_code, and both would be NULL. You cannot add a fact about a department without an employee.

2b. Update Anomaly

Alice's department head changes from Dr. Smith to Dr. Patel. Every row where dept_id = D10 must be updated in lockstep. If the UPDATE statement misses even one row (a bug, an abort, a concurrent writer), the table now asserts two different department heads for D10 simultaneously. The same fact stored in many rows can drift apart.

2c. Deletion Anomaly

Carol leaves the company, and her only row is deleted. If Dave also leaves at the same time, all knowledge of department D20 and its head Ms. Jones is gone. Deleting an employee should not erase facts about a department.

Normalization solves these by splitting this one table into three:


employees              departments            assignments
+--------+-------+     +---------+----------+ +--------+-----------+-------+
| emp_id | name  |     | dept_id | name     | | emp_id | project   | hours |
+--------+-------+     +---------+----------+ +--------+-----------+-------+
| 101    | Alice |     | D10     | Research | | 101    | P-ALPHA   | 20    |
| 102    | Bob   |     | D20     | Sales    | | 101    | P-BETA    | 15    |
| 103    | Carol |     | D30     | Marketing| | 102    | P-ALPHA   | 35    |
| 104    | Dave  |     +---------+----------+ | 103    | P-GAMMA   | 40    |
+--------+-------+                             | 104    | P-GAMMA   | 30    |
                                               +--------+-----------+-------+

Now the Marketing department exists with no employees. The department head is stored once, not four times. Deleting Carol erases nothing about D20.

3. Functional Dependencies: The Foundation

Normalization theory is built on the concept of a functional dependency (FD). An FD is a constraint between two sets of attributes in a relation.

Notation: X → Y reads as "X functionally determines Y" or "Y is functionally dependent on X." It means: whenever two tuples agree on the values of attributes in X, they must also agree on the values in Y.

For example, in an employees table, emp_id → emp_name says: if two rows have the same employee id, they must have the same employee name. That is a statement about the real world, not about the current data — it holds for every possible instance of the table.

3a. Trivial vs Non-Trivial FDs

An FD X → Y is trivial if Y is a subset of X. For example, {emp_id, name} → emp_id is trivial — of course two rows that agree on id and name agree on id. Trivial FDs hold in every relation and carry no design information.

A non-trivial FD is one where Y is not a subset of X. These are the FDs that drive normalization.

3b. Armstrong's Axioms

William W. Armstrong (1974) gave a sound and complete set of inference rules for deriving all FDs that follow from a given set. The axioms are:

Reflexivity: if Y ⊆ X then X → Y.
Augmentation: if X → Y then XZ → YZ for any Z.
Transitivity: if X → Y and Y → Z then X → Z.

From these three, additional rules can be derived:

Union: if X → Y and X → Z then X → YZ.
Decomposition: if X → YZ then X → Y and X → Z.
Pseudotransitivity: if X → Y and WY → Z then WX → Z.

These rules together are sound (they derive only FDs that hold) and complete (they derive every FD that logically follows from the given set).

3c. Closure of an Attribute Set

Given a set of FDs F and an attribute set X, the closure X⁺ is the set of all attributes that X functionally determines. Computing X⁺ is the workhorse algorithm of normalization theory — it answers the questions "is X a superkey?" and "does X → Y follow from F?".

Algorithm (informal):


closure(X, F):
  result ← X
  repeat:
    for each FD  A → B  in F:
      if A ⊆ result:
        result ← result ∪ B
  until result stops growing
  return result

3d. Finding Candidate Keys with Closures

A superkey is any attribute set whose closure contains every attribute. A candidate key is a minimal superkey — no proper subset is itself a superkey. The primary key is a candidate key chosen by the designer.

Example. Consider R(A, B, C, D) with FDs:


F = { A → B,  B → C,  CD → A }

Compute candidate keys by finding minimal X with X⁺ = {A, B, C, D}.

{A}⁺   = {A, B, C}            not a key (missing D)
{D}⁺   = {D}                  not a key
{A,D}⁺ = {A, D, B, C}         key
{B,D}⁺ = {B, D, C, A}         key  (using B→C, CD→A)
{C,D}⁺ = {C, D, A, B}         key

Candidate keys: {A,D}, {B,D}, {C,D}.
Prime attributes: A, B, C, D.

3e. Functional Dependency Diagrams

A compact way to visualize FDs in a single relation:


orders_denormalized(order_id, product_id, customer_id, customer_name, product_name, qty)

  order_id  ────►  customer_id
                       │
                       ▼
                   customer_name

  product_id ────►  product_name

  {order_id, product_id} ────►  qty         (the PK)

Arrows originating from non-key attributes or from proper subsets of the composite key are red flags — they point at 2NF or 3NF violations.

4. First Normal Form (1NF)

Rule: every attribute value is atomic; there are no repeating groups, arrays, or nested structures. Every row has the same number of columns, and every cell holds one value from the attribute's domain.

4a. Before: Repeating Groups


contacts (violates 1NF)
+--------+---------+---------------------------------+
| cid    | name    | phone_numbers                   |
+--------+---------+---------------------------------+
| 1      | Alice   | 555-1111, 555-2222              |
| 2      | Bob     | 555-3333                        |
| 3      | Carol   | 555-4444, 555-5555, 555-6666    |
+--------+---------+---------------------------------+

The phone_numbers cell holds a list. Querying "who has phone 555-2222?" forces a substring search. Updating a single number is awkward.

4b. After: Atomic Values in a Child Table


CREATE TABLE contacts (
  cid   INT PRIMARY KEY,
  name  VARCHAR(100) NOT NULL
);

CREATE TABLE contact_phones (
  cid          INT NOT NULL REFERENCES contacts(cid) ON DELETE CASCADE,
  phone_number VARCHAR(20) NOT NULL,
  phone_type   VARCHAR(10),
  PRIMARY KEY (cid, phone_number)
);


contacts              contact_phones
+-----+-------+       +-----+--------------+------------+
| cid | name  |       | cid | phone_number | phone_type |
+-----+-------+       +-----+--------------+------------+
| 1   | Alice |       | 1   | 555-1111     | mobile     |
| 2   | Bob   |       | 1   | 555-2222     | home       |
| 3   | Carol |       | 2   | 555-3333     | mobile     |
+-----+-------+       | 3   | 555-4444     | mobile     |
                      | 3   | 555-5555     | home       |
                      | 3   | 555-6666     | work       |
                      +-----+--------------+------------+

4c. 1NF and JSON Columns

Modern databases (PostgreSQL, MySQL, SQL Server) offer JSON and JSONB column types. Strictly speaking, a JSON column can hold a nested array, which looks like a 1NF violation. Practitioners are split:

Purist view: the relational model is about atomic cells, and an array in JSON is a repeating group by another name.
Pragmatic view: modern engines can index inside JSON (GIN in PostgreSQL, path expressions in SQL Server), so the database treats it as first-class data, not an opaque blob.

A useful rule of thumb: if you will filter, join, or aggregate on the nested values, they belong in their own table. If the JSON is a self-contained payload (webhook body, event envelope, extension metadata), a JSON column is fine.

5. Second Normal Form (2NF)

Rule: the relation is in 1NF, and every non-prime attribute is fully functionally dependent on every candidate key — no partial dependency on a proper subset of a composite key.

2NF is only a concern when a table has a composite primary key. If the PK is a single column, the table is automatically in 2NF (there are no proper non-empty subsets of a single attribute).

5a. Violation


order_items (1NF but not 2NF)
PK = (order_id, product_id)

+----------+------------+---------------+---------------+----------+
| order_id | product_id | product_name  | product_price | quantity |
+----------+------------+---------------+---------------+----------+
| 1001     | P-100      | USB-C Cable   | 12.99         | 2        |
| 1001     | P-200      | HDMI Adapter  | 19.99         | 1        |
| 1002     | P-100      | USB-C Cable   | 12.99         | 3        |
| 1003     | P-200      | HDMI Adapter  | 19.99         | 1        |
+----------+------------+---------------+---------------+----------+

FDs:
  {order_id, product_id} → quantity
  product_id             → product_name
  product_id             → product_price

The last two FDs say product attributes depend on only PART of the PK.
That is a partial dependency — 2NF violation.

Consequences: the product name and price repeat on every order_item. Raising a product's price requires updating every order_items row. A product that has never been ordered cannot be stored here at all.

5b. Decomposition


CREATE TABLE products (
  product_id    VARCHAR(10) PRIMARY KEY,
  product_name  VARCHAR(100) NOT NULL,
  product_price DECIMAL(10, 2) NOT NULL CHECK (product_price >= 0)
);

CREATE TABLE order_items (
  order_id   INT NOT NULL,
  product_id VARCHAR(10) NOT NULL REFERENCES products(product_id),
  quantity   INT NOT NULL CHECK (quantity > 0),
  PRIMARY KEY (order_id, product_id)
);


products                                   order_items
+------------+---------------+---------+   +----------+------------+----------+
| product_id | product_name  | price   |   | order_id | product_id | quantity |
+------------+---------------+---------+   +----------+------------+----------+
| P-100      | USB-C Cable   | 12.99   |   | 1001     | P-100      | 2        |
| P-200      | HDMI Adapter  | 19.99   |   | 1001     | P-200      | 1        |
+------------+---------------+---------+   | 1002     | P-100      | 3        |
                                           | 1003     | P-200      | 1        |
                                           +----------+------------+----------+

Price is now stored once per product. A new product can exist without any order. Raising a price is one UPDATE.

6. Third Normal Form (3NF)

Rule: the relation is in 2NF, and no non-prime attribute is transitively dependent on a candidate key. Equivalently (Zaniolo's definition): for every non-trivial FD X → A, either X is a superkey or A is a prime attribute.

A transitive dependency is when K → X and X → A where X is not a superkey and A is not prime. The non-key attribute A depends on the key only "through" another non-key attribute.

The classic rhyme, attributed to the database-design folklore of the 1980s:

"The key, the whole key, and nothing but the key — so help me Codd."

1NF enforces the key (atomic values keyed by the PK). 2NF enforces the whole key (no partial dependencies). 3NF enforces nothing but the key (no transitive dependencies).

6a. Violation


employees (2NF but not 3NF)
PK = emp_id

+--------+----------+---------+-----------+---------------+
| emp_id | name     | dept_id | dept_name | dept_location |
+--------+----------+---------+-----------+---------------+
| 101    | Alice    | D10     | Research  | Building A    |
| 102    | Bob      | D10     | Research  | Building A    |
| 103    | Carol    | D20     | Sales     | Building C    |
| 104    | Dave     | D20     | Sales     | Building C    |
+--------+----------+---------+-----------+---------------+

FDs:
  emp_id   → name, dept_id
  dept_id  → dept_name, dept_location

dept_name depends on emp_id only via dept_id — transitive dependency.

6b. Decomposition


CREATE TABLE departments (
  dept_id       VARCHAR(10) PRIMARY KEY,
  dept_name     VARCHAR(100) NOT NULL,
  dept_location VARCHAR(100)
);

CREATE TABLE employees (
  emp_id   INT PRIMARY KEY,
  name     VARCHAR(100) NOT NULL,
  dept_id  VARCHAR(10) REFERENCES departments(dept_id)
);


departments                                employees
+---------+-----------+---------------+    +--------+-------+---------+
| dept_id | dept_name | dept_location |    | emp_id | name  | dept_id |
+---------+-----------+---------------+    +--------+-------+---------+
| D10     | Research  | Building A    |    | 101    | Alice | D10     |
| D20     | Sales     | Building C    |    | 102    | Bob   | D10     |
+---------+-----------+---------------+    | 103    | Carol | D20     |
                                           | 104    | Dave  | D20     |
                                           +--------+-------+---------+

Moving a department to a new building is one UPDATE. Inserting D30 Marketing with no employees yet is a simple INSERT into departments. Deleting the last Sales employee does not erase facts about Sales.

7. Boyce-Codd Normal Form (BCNF)

Rule: for every non-trivial FD X → Y, X must be a superkey. BCNF is stricter than 3NF because it does not allow the "A is prime" escape hatch.

Raymond F. Boyce and Edgar F. Codd proposed BCNF in 1974 as a refinement of 3NF. In practice, most well-designed 3NF schemas are already in BCNF. The difference shows up only with overlapping candidate keys.

7a. A 3NF Table That Violates BCNF

Suppose a consulting firm tracks which consultant advises which client, and each client has one "primary consultant per specialty":


consulting (in 3NF, not BCNF)
Candidate keys:  (client_id, specialty)  and  (client_id, consultant_id)

+-----------+------------+---------------+
| client_id | specialty  | consultant_id |
+-----------+------------+---------------+
| C-1       | Tax        | CON-100       |
| C-1       | Audit      | CON-200       |
| C-2       | Tax        | CON-100       |
| C-2       | IT         | CON-300       |
+-----------+------------+---------------+

FDs:
  {client_id, specialty}     → consultant_id
  consultant_id              → specialty   ←← each consultant has one specialty

Every non-prime attribute? There aren't any — every attribute is prime.
Therefore this satisfies 3NF trivially.

But consultant_id → specialty has a non-superkey determinant.
consultant_id alone is not a superkey (you can't derive client_id).
BCNF violated.

The anomaly: if consultant CON-100's specialty shifts from Tax to Advisory, every row mentioning CON-100 must be updated in lockstep, or the table will claim CON-100 has two specialties.

7b. Decomposition to BCNF


CREATE TABLE consultants (
  consultant_id VARCHAR(10) PRIMARY KEY,
  specialty     VARCHAR(50) NOT NULL
);

CREATE TABLE client_consultants (
  client_id     VARCHAR(10) NOT NULL,
  consultant_id VARCHAR(10) NOT NULL REFERENCES consultants(consultant_id),
  PRIMARY KEY (client_id, consultant_id)
);

The decomposition is lossless (joining the two back reconstructs the original), but it is not dependency-preserving. The FD {client_id, specialty} → consultant_id can no longer be enforced by a single-table key constraint — you would need an application-level or trigger-based check. This trade-off is the reason textbooks teach "decompose to 3NF if dependency preservation matters, to BCNF if anomaly-freedom matters."

8. Fourth Normal Form (4NF)

Ronald Fagin introduced 4NF in 1977 to handle multi-valued dependencies (MVDs), which FDs cannot describe.

Notation: X ↠ Y (a double arrow) reads "X multidetermines Y." An MVD says: the set of Y-values associated with a given X-value is independent of any other attributes in the relation.

Rule: a relation is in 4NF if it is in BCNF and every non-trivial MVD X ↠ Y has X as a superkey.

8a. Violation

An employee can know several programming languages and several spoken languages, and the two sets are independent of each other. Stuffing them into one table creates a spurious cross-product:


employee_skills (BCNF, not 4NF)
PK = (emp_id, prog_lang, spoken_lang)   -- the whole row

+--------+-----------+-------------+
| emp_id | prog_lang | spoken_lang |
+--------+-----------+-------------+
| 101    | Python    | English     |
| 101    | Python    | Spanish     |
| 101    | Go        | English     |
| 101    | Go        | Spanish     |
+--------+-----------+-------------+

MVDs:
  emp_id ↠ prog_lang
  emp_id ↠ spoken_lang

Adding a third spoken language forces 2 new rows (one per prog_lang).
Adding a third prog_lang forces 2 new rows (one per spoken_lang).

8b. Decomposition


CREATE TABLE employee_prog_langs (
  emp_id    INT NOT NULL,
  prog_lang VARCHAR(30) NOT NULL,
  PRIMARY KEY (emp_id, prog_lang)
);

CREATE TABLE employee_spoken_langs (
  emp_id      INT NOT NULL,
  spoken_lang VARCHAR(30) NOT NULL,
  PRIMARY KEY (emp_id, spoken_lang)
);


employee_prog_langs        employee_spoken_langs
+--------+-----------+     +--------+-------------+
| emp_id | prog_lang |     | emp_id | spoken_lang |
+--------+-----------+     +--------+-------------+
| 101    | Python    |     | 101    | English     |
| 101    | Go        |     | 101    | Spanish     |
+--------+-----------+     +--------+-------------+

Now adding a third spoken language is one row. The Cartesian explosion is gone.

9. Fifth Normal Form (5NF) / Project-Join Normal Form (PJNF)

Rule: a relation is in 5NF if every non-trivial join dependency is implied by the candidate keys. Informally: the table cannot be losslessly decomposed into smaller projections except via its keys.

5NF eliminates certain three-way (or higher) relationships that cannot be captured by binary FDs or MVDs. In practice, 5NF violations are rare and often contrived, but they do exist.

9a. Classic Example

Suppose agents sell products from companies, with the business rule: if agent A represents company C, and company C makes product P, and agent A sells product P (to any company), then agent A must sell product P for company C as well. This is a three-way symmetric constraint.


agent_company_product (not 5NF)
+-------+---------+---------+
| agent | company | product |
+-------+---------+---------+
| A1    | Acme    | Widget  |
| A1    | Acme    | Gadget  |
| A1    | Beta    | Widget  |
| A2    | Acme    | Gadget  |
+-------+---------+---------+

No FD or MVD violation. But under the three-way rule, the table can be
losslessly reconstructed from three binary projections:

  agent_company(agent, company)
  company_product(company, product)
  agent_product(agent, product)

That reconstruction-via-three-projections is a JOIN DEPENDENCY.
It is not implied by any key. Hence not 5NF.

Decomposition: store the three binary tables separately. The original is recovered by the natural three-way join.

In real-world design, most developers never encounter a 5NF violation. It is worth knowing the form exists mostly to reason about data-vault-style EAV designs and certain many-to-many-to-many junction tables.

10. Domain-Key Normal Form (DKNF)

Ronald Fagin introduced DKNF in 1981. Rule: every constraint on the relation is a logical consequence of the domain constraints (valid values for each attribute) and the key constraints.

If a schema is in DKNF, no insert, update, or delete anomaly can occur by definition. In principle, achieving DKNF guarantees correctness at the schema level.

In practice, DKNF is mostly theoretical:

There is no general algorithm to transform an arbitrary schema into DKNF.
Many real-world constraints (e.g., "pay grade must match job title rule table entries") are not expressible purely via domains and keys.
Achieving DKNF may require decomposition that destroys dependency preservation or useful query shapes.

DKNF is worth studying for the insight it gives into why the lower forms exist — every lower form is a step toward (but not reaching) this ideal.

11. Sixth Normal Form (6NF)

6NF was introduced by C.J. Date, Hugh Darwen, and Nikos Lorentzos in the 2000s, primarily for temporal databases. Rule: a relation satisfies 6NF if it admits no non-trivial join dependencies at all — equivalently, it cannot be decomposed further without losing information.

In practice, 6NF means each table has a key plus at most one non-key attribute. Every fact is stored in its own tiny table.


Anchor Modeling / 6NF style:

employees            emp_name                emp_salary
+--------+           +--------+-------+      +--------+-------+----------+
| emp_id |           | emp_id | name  |      | emp_id | amt   | valid_to |
+--------+           +--------+-------+      +--------+-------+----------+
| 101    |           | 101    | Alice |      | 101    | 80000 |2024-06-30|
| 102    |           | 102    | Bob   |      | 101    | 90000 |9999-12-31|
+--------+           +--------+-------+      | 102    | 75000 |9999-12-31|
                                             +--------+-------+----------+

Usage:

Temporal systems — every attribute is time-versioned independently. A change to salary does not force a new name row.
Anchor Modeling — an entire modeling methodology built around 6NF anchors, attributes, and ties.
Data Vault 2.0 — satellites are essentially 6NF, with one (or a few) attribute per satellite plus load-date and hash diff.

Trade-off: extreme decomposition produces huge join counts. A simple "show this employee" query may need 15 joins. Columnar engines (Snowflake, BigQuery, Databricks) handle this better than row stores, which is why 6NF is associated with modern data warehousing rather than traditional OLTP.

12. The Normalization Workflow in Practice

A normalization exercise typically proceeds through these steps:

Collect a user view. Start from a form, a report, a spreadsheet, or an API payload — the flattest possible representation of what the business needs.
Identify the attributes. List every column mentioned in the view.
Identify the functional dependencies. Talk to domain experts. Ask questions of the form "given X, is Y always determined?" This is the hardest part — FDs are statements about the business rules, not about the current sample data.
Compute candidate keys. Use closures to find minimal superkeys.
Decompose to 3NF (or BCNF). Synthesize relations from the FDs. The standard algorithm (Bernstein synthesis for 3NF) guarantees lossless join and dependency preservation.
Validate. Check two properties: lossless join (the natural join of the decomposed relations reproduces the original) and dependency preservation (every original FD can be enforced by a key in some decomposed relation).
Consider 4NF/5NF. Look for multi-valued and join dependencies the FD analysis missed.
Consider denormalization. If the workload is read-heavy and join cost dominates, selectively denormalize with discipline.

13. Denormalization: When to Break the Rules

Normalization is a rule set; denormalization is a deliberate exception. The question is not "normalized or denormalized?" but "which parts of the schema benefit from which approach?"

13a. When Denormalization Is Justified

Read-heavy analytical workloads. A fact table with dimensional descriptors (cust_name, prod_category) avoids joins on every query.
Star and snowflake schemas. These are deliberately denormalized for query performance in data warehouses.
Materialized views. Pre-computed aggregates that trade storage for query speed, refreshed on a schedule.
Derived / computed values. order_total stored alongside order_items rather than recomputed per query.
Search and navigation. Breadcrumb paths, category lineages, and search documents are often flattened.

13b. The Cost

Every denormalized copy of a fact is a potential inconsistency waiting to happen. The cost is paid in:

Update complexity. Changing a product name now requires updating N places.
Drift risk. Any bug, partial failure, or concurrent writer can leave copies out of sync.
ETL/ELT discipline. Mitigation lives in the pipeline: idempotent upserts, full-refresh patterns, hash-based change detection, and tests that compare source vs. target.

13c. The Middle Ground

Materialized views and columnar caches give the query-speed benefit of denormalization while keeping the system of record normalized. The warehouse pattern is:


  OLTP (3NF/BCNF)  ───►  ELT  ───►  Warehouse (Star / Snowflake)  ───►  BI
    system of record       pipeline       query-optimized            dashboards

14. Normalization vs Data Warehousing

Two schools of thought govern enterprise warehouse design:

14a. Kimball (Bottom-Up, Star-First)

Ralph Kimball's The Data Warehouse Toolkit advocates dimensional modeling from day one. The warehouse is a collection of conformed star schemas, each serving a business process. Normalization is relegated to the source systems; the warehouse is denormalized on purpose.

14b. Inmon (Top-Down, 3NF-First)

Bill Inmon's Corporate Information Factory advocates a single, enterprise-wide 3NF warehouse as the integrated source of truth, with downstream dependent data marts in star form for specific analytical use cases.

14c. The Modern Synthesis

Most real architectures combine both:

OLTP systems in 3NF/BCNF (write-optimized).
A lakehouse raw/bronze layer that mirrors source tables as-is.
A silver layer in 3NF or Data Vault 2.0 (integrated, 6NF-style in the Data Vault case).
A gold layer of star/snowflake marts for dashboards and self-service BI.

15. End-to-End Worked Example: Sales Database

Start with a single denormalized "sales report" table — the kind you might export from a spreadsheet.

15a. Unnormalized Form


sales_flat (0NF / UNF)

+--------+------------+--------------+--------------------+---------+----------------+--------+
| sale_id| sale_date  | customer     | customer_email     | region  | items          | total  |
+--------+------------+--------------+--------------------+---------+----------------+--------+
| 1      | 2026-01-15 | Alice Chen   | alice@example.com  | West    | P1:2, P2:1     | 45.97  |
| 2      | 2026-01-16 | Bob Ortiz    | bob@example.com    | East    | P1:1           | 12.99  |
| 3      | 2026-01-17 | Alice Chen   | alice@example.com  | West    | P3:3, P1:1     | 57.96  |
+--------+------------+--------------+--------------------+---------+----------------+--------+

The items column is a repeating group — violates 1NF.

15b. Step 1: 1NF (Atomic Values)


CREATE TABLE sales_1nf (
  sale_id        INT NOT NULL,
  sale_date      DATE NOT NULL,
  customer       VARCHAR(100) NOT NULL,
  customer_email VARCHAR(100),
  region         VARCHAR(50),
  product_id     VARCHAR(10) NOT NULL,
  product_name   VARCHAR(100) NOT NULL,
  unit_price     DECIMAL(10, 2) NOT NULL,
  quantity       INT NOT NULL,
  line_total     DECIMAL(10, 2) NOT NULL,
  PRIMARY KEY (sale_id, product_id)
);


+---------+------------+------------+-------------------+--------+------------+--------------+-------+-----+----------+
| sale_id | sale_date  | customer   | customer_email    | region | product_id | product_name | price | qty | line_tot |
+---------+------------+------------+-------------------+--------+------------+--------------+-------+-----+----------+
| 1       | 2026-01-15 | Alice Chen | alice@example.com | West   | P1         | Cable        | 12.99 | 2   | 25.98    |
| 1       | 2026-01-15 | Alice Chen | alice@example.com | West   | P2         | Adapter      | 19.99 | 1   | 19.99    |
| 2       | 2026-01-16 | Bob Ortiz  | bob@example.com   | East   | P1         | Cable        | 12.99 | 1   | 12.99    |
| 3       | 2026-01-17 | Alice Chen | alice@example.com | West   | P3         | Hub          | 14.99 | 3   | 44.97    |
| 3       | 2026-01-17 | Alice Chen | alice@example.com | West   | P1         | Cable        | 12.99 | 1   | 12.99    |
+---------+------------+------------+-------------------+--------+------------+--------------+-------+-----+----------+

Now atomic, but still a mess:
  FDs:
    sale_id    → sale_date, customer, customer_email, region
    product_id → product_name, unit_price
    {sale_id, product_id} → quantity, line_total

15c. Step 2: 2NF (Remove Partial Dependencies)

The PK is (sale_id, product_id). Attributes that depend only on sale_id go to a sales table; attributes that depend only on product_id go to products.


CREATE TABLE sales_2nf (
  sale_id        INT PRIMARY KEY,
  sale_date      DATE NOT NULL,
  customer       VARCHAR(100) NOT NULL,
  customer_email VARCHAR(100),
  region         VARCHAR(50)
);

CREATE TABLE products_2nf (
  product_id   VARCHAR(10) PRIMARY KEY,
  product_name VARCHAR(100) NOT NULL,
  unit_price   DECIMAL(10, 2) NOT NULL
);

CREATE TABLE sale_items_2nf (
  sale_id    INT NOT NULL REFERENCES sales_2nf(sale_id),
  product_id VARCHAR(10) NOT NULL REFERENCES products_2nf(product_id),
  quantity   INT NOT NULL,
  line_total DECIMAL(10, 2) NOT NULL,
  PRIMARY KEY (sale_id, product_id)
);


sales_2nf
+---------+------------+------------+-------------------+--------+
| sale_id | sale_date  | customer   | customer_email    | region |
+---------+------------+------------+-------------------+--------+
| 1       | 2026-01-15 | Alice Chen | alice@example.com | West   |
| 2       | 2026-01-16 | Bob Ortiz  | bob@example.com   | East   |
| 3       | 2026-01-17 | Alice Chen | alice@example.com | West   |
+---------+------------+------------+-------------------+--------+

products_2nf
+------------+--------------+------------+
| product_id | product_name | unit_price |
+------------+--------------+------------+
| P1         | Cable        | 12.99      |
| P2         | Adapter      | 19.99      |
| P3         | Hub          | 14.99      |
+------------+--------------+------------+

sale_items_2nf
+---------+------------+-----+-----------+
| sale_id | product_id | qty | line_tot  |
+---------+------------+-----+-----------+
| 1       | P1         | 2   | 25.98     |
| 1       | P2         | 1   | 19.99     |
| 2       | P1         | 1   | 12.99     |
| 3       | P3         | 3   | 44.97     |
| 3       | P1         | 1   | 12.99     |
+---------+------------+-----+-----------+

15d. Step 3: 3NF (Remove Transitive Dependencies)

In sales_2nf, customer_email actually depends on the customer identity, not on sale_id directly. If Alice appears on three sales rows, her email is stored three times. Same for the fact that Alice is in the West region. Promote customers to their own table.

Similarly, line_total is a derived attribute — it equals quantity * unit_price. Strict 3NF design drops derived attributes (or makes them GENERATED columns).


CREATE TABLE customers_3nf (
  customer_id    SERIAL PRIMARY KEY,
  customer_name  VARCHAR(100) NOT NULL,
  customer_email VARCHAR(100) UNIQUE,
  region         VARCHAR(50)
);

CREATE TABLE products_3nf (
  product_id   VARCHAR(10) PRIMARY KEY,
  product_name VARCHAR(100) NOT NULL,
  unit_price   DECIMAL(10, 2) NOT NULL
);

CREATE TABLE sales_3nf (
  sale_id      SERIAL PRIMARY KEY,
  sale_date    DATE NOT NULL,
  customer_id  INT NOT NULL REFERENCES customers_3nf(customer_id)
);

CREATE TABLE sale_items_3nf (
  sale_id    INT NOT NULL REFERENCES sales_3nf(sale_id) ON DELETE CASCADE,
  product_id VARCHAR(10) NOT NULL REFERENCES products_3nf(product_id),
  quantity   INT NOT NULL CHECK (quantity > 0),
  -- line_total is derived; either drop it, or make it a generated column:
  line_total DECIMAL(10, 2) GENERATED ALWAYS AS (quantity * 0.00) STORED,
  PRIMARY KEY (sale_id, product_id)
);


customers_3nf
+----+------------+-------------------+--------+
| id | name       | email             | region |
+----+------------+-------------------+--------+
| 1  | Alice Chen | alice@example.com | West   |
| 2  | Bob Ortiz  | bob@example.com   | East   |
+----+------------+-------------------+--------+

sales_3nf
+---------+------------+-------------+
| sale_id | sale_date  | customer_id |
+---------+------------+-------------+
| 1       | 2026-01-15 | 1           |
| 2       | 2026-01-16 | 2           |
| 3       | 2026-01-17 | 1           |
+---------+------------+-------------+

Now Alice's email and region are stored once. A customer moves from
West to East: one UPDATE. A product price changes: one UPDATE.

15e. Step 4: BCNF Check

Is every determinant a superkey?

customer_id → name, email, region — customer_id is the PK of customers_3nf. OK.
product_id → name, unit_price — product_id is the PK of products_3nf. OK.
sale_id → date, customer_id — sale_id is the PK of sales_3nf. OK.
{sale_id, product_id} → quantity — that's the PK of sale_items_3nf. OK.

Every non-trivial FD has a superkey determinant. The schema is in BCNF.

15f. Step 5: Denormalize Selectively for Reporting

For the BI dashboard that runs "sales by region by month," joining four tables on every query is wasteful. Build a gold-layer star schema on top:


CREATE TABLE dim_customer (
  cust_key       SERIAL PRIMARY KEY,
  customer_id    INT NOT NULL,
  customer_name  VARCHAR(100) NOT NULL,
  region         VARCHAR(50)
);

CREATE TABLE dim_product (
  prod_key     SERIAL PRIMARY KEY,
  product_id   VARCHAR(10) NOT NULL,
  product_name VARCHAR(100) NOT NULL
);

CREATE TABLE dim_date (
  date_key  INT PRIMARY KEY,    -- YYYYMMDD
  full_date DATE NOT NULL,
  year      INT, month INT, quarter INT
);

CREATE TABLE fact_sales (
  sale_id    BIGSERIAL PRIMARY KEY,
  cust_key   INT NOT NULL REFERENCES dim_customer(cust_key),
  prod_key   INT NOT NULL REFERENCES dim_product(prod_key),
  date_key   INT NOT NULL REFERENCES dim_date(date_key),
  quantity   INT NOT NULL,
  unit_price DECIMAL(10, 2) NOT NULL,
  revenue    DECIMAL(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);

OLTP stays in BCNF. OLAP stays in star. ELT pipes data from one to the other on a schedule. Each world is normalized (or denormalized) for its own access pattern.

16. Tools and Utilities

No tool replaces human reasoning about functional dependencies — FDs are business rules, and business rules come from people. But several tools help validate and maintain a normalized schema:

pg_stats / pg_stat_statements (PostgreSQL). Surface tables with skewed cardinality or high-cost queries that often indicate missing indexes or over-denormalization.
pgTAP / SchemaSpy. Document table relationships and generate ER diagrams from an existing schema.
ER modeling tools (dbdiagram.io, Lucidchart, DBeaver's ER view, ERwin, MySQL Workbench). Let you draw the model and generate DDL, or reverse-engineer DDL into a diagram. Good for spotting missing junction tables and overloaded entities.
SQL linters and migration review tools (sqlfluff, squawk). Catch anti-patterns such as columns that should be foreign keys, nullable columns without a reason, and missing indexes on FK columns.
Query plan analyzers (EXPLAIN (ANALYZE, BUFFERS)). Reveal whether denormalization is paying off or whether a well-placed index makes the normalized form fast enough.
dbt tests. Enforce uniqueness, referential integrity, and value-set constraints in warehouse models — a practical way to keep star schemas honest despite their denormalized shape.
Data Vault and Anchor Modeling tools (WhereScape, datavault4dbt, TimeXtender). Generate 6NF-style satellite tables and load patterns automatically.

17. Common Mistakes

17a. Over-Normalizing OLTP

Every join has a cost. A transactional workload that decomposes its schema into 20 tables when 6 would do is paying for joins on every user request. If a 3NF decomposition produces tables that are always joined together in every query, consider whether the decomposition is serving a real business constraint or an academic one.

17b. Under-Normalizing Analytics

The opposite mistake: assuming "it's OLAP, it's denormalized, anything goes." Even star schemas have rules. Conformed dimensions must be consistent across fact tables. A customer attribute duplicated across three fact tables that drift apart is the same update anomaly that 3NF was designed to prevent. Dimensional modeling is a different kind of normalization, not a lack of it.

17c. Confusing 1NF with "No JSON Allowed"

1NF says values are atomic from the perspective of the data model. A JSON column holding a self-contained document that the database treats as a single opaque value is defensible. A JSON column that applications routinely pick apart to filter and join on is a repeating group by another name — move those fields to proper columns or child tables.

17d. Mistaking Sample Data for FDs

An FD is a statement about every possible instance of the relation, not just the rows you have today. Deriving FDs by inspecting current data will miss real constraints that sample data happens to satisfy by accident, and will infer fake constraints that the business never promised. FDs come from domain conversation, not from SELECT COUNT(DISTINCT ...).

17e. Treating "Normal Forms" as a Checklist

The normal forms are diagnostic tools, not a certification. A schema that passes every form on paper but is painful to query or evolve is not a success. The goal is a schema that accurately models the business and supports the workload. Normalization is a method, not the destination.

17f. Forgetting Dependency Preservation

When decomposing to BCNF, it is possible to end up with a schema where some original FD cannot be enforced by any single-table key. The FD must then be enforced by triggers or application code, which is error-prone. If dependency preservation matters more than strict anomaly-freedom, stop at 3NF.

Summary

Normalization is Codd's prescription for storing each fact in exactly one place. Functional dependencies express which attributes determine which others; normal forms translate those dependencies into progressively stricter structural rules. The practical workflow is:

Identify FDs from business rules, not from sample data.
Compute closures and candidate keys.
Decompose to 3NF by default; go to BCNF when overlapping candidate keys force you there.
Consider 4NF when independent multi-valued attributes appear in the same table.
Use 5NF, DKNF, and 6NF selectively — mostly for temporal or data-vault-style warehouses.
Denormalize on purpose for analytical workloads, and pay the cost in pipeline discipline.

The normalized OLTP store and the denormalized OLAP mart are not competitors — they are two sides of the same well-designed data architecture.