Differential Privacy for Aggregates

A firm running document intelligence across hundreds of matters will eventually want cross-matter analytics: "how many contracts reference arbitration clauses?", "average redline turnaround by practice group", "which opposing firms appear most often in our settlements?". Every one of these aggregates can leak information about individual documents or individuals, especially at the tails where a single distinctive record dominates the result.

Differential privacy (DP) adds calibrated noise so that the presence or absence of any single record cannot meaningfully change the published aggregate. The guarantee is mathematical and composable across queries under a privacy budget (ε, δ).



1. The Definition in Plain Terms

A randomized mechanism M is (ε, δ)-differentially private if, for any two datasets D and D' differing in a single record, and any output S:

Pr[M(D) ∈ S] ≤ exp(ε) · Pr[M(D') ∈ S] + δ

Smaller ε = stronger privacy, noisier answers. Typical production values are ε between 0.1 and 3 per release, with δ ~10−6 scaled to dataset size.


2. Mechanisms: Laplace and Gaussian


3. Example: Counts and Averages with Noise

import math
import numpy as np


def laplace_noise(sensitivity: float, epsilon: float) -> float:
    # Scale b = Δf / ε
    return np.random.laplace(loc=0.0, scale=sensitivity / epsilon)


def dp_count(records: list, predicate, epsilon: float) -> float:
    true_count = sum(1 for r in records if predicate(r))
    # A single record changes any count by at most 1.
    return true_count + laplace_noise(sensitivity=1.0, epsilon=epsilon)


def dp_mean(values: list[float], lo: float, hi: float, epsilon: float) -> float:
    # Clip to a known range so sensitivity is bounded.
    clipped = np.clip(values, lo, hi)
    n = len(clipped)
    # Split budget between numerator and denominator.
    eps_sum, eps_cnt = epsilon / 2, epsilon / 2
    noisy_sum = clipped.sum() + laplace_noise(sensitivity=(hi - lo), epsilon=eps_sum)
    noisy_cnt = n            + laplace_noise(sensitivity=1.0,       epsilon=eps_cnt)
    return noisy_sum / max(noisy_cnt, 1.0)


# Example: "fraction of contracts with an arbitration clause"
records = [{"has_arb": True}, {"has_arb": False}, {"has_arb": True}]
print(dp_count(records, lambda r: r["has_arb"], epsilon=0.5))

4. Privacy Budget & Accounting

Every query consumes budget. A dashboard that runs 20 DP queries at ε=0.1 each has a cumulative privacy cost of ε=2.0 under basic (sequential) composition — tighter under advanced composition or RDP. Track the budget per-dataset and refuse queries that would exceed the cap:


5. Where to Apply DP in the Pipeline


6. Pitfalls


↑ Back to Top