Personally Identifiable Information (PII) & Privileged-Content Redaction on Ingest

In a legal document-intelligence platform, the single riskiest moment in the data lifecycle is ingest. Once a document has been split, embedded, and written into a vector store, any personally identifiable information (PII) or attorney–client privileged material it contains is effectively published to every downstream system that can read that store — including retrieval caches, prompt logs, and the context windows of external model providers. Redaction must happen before persistence, not after.

This page describes the redaction pipeline used on ingest: what is detected, how it is masked, how reversibility is preserved for authorized users, and how the system fails safely when classification confidence is low.

1. What Counts as PII and Privileged Content

The pipeline recognizes three overlapping categories, each with its own handling policy:

PII — direct identifiers (SSN, DOB, driver's license, account numbers, phone, email, home address) and indirect identifiers (employer + ZIP + DOB combinations, medical record numbers). Covered by GDPR, CCPA, and HIPAA where applicable.
Privileged content — attorney–client communications, attorney work product, settlement communications under FRE 408, and mediation-privileged material. Detection relies on document metadata (matter ID, custodian role), header/footer markers ("PRIVILEGED & CONFIDENTIAL"), and structural cues (salutations naming outside counsel).
Regulated content — PHI under HIPAA, payment data under PCI, export-controlled technical data under ITAR/EAR. These trigger additional routing constraints on top of redaction.

2. Pipeline Overview

Upload ──> Virus scan ──> Text extraction ──> Classifier ──> Redactor ──> Persistence
                                                    │              │              │
                                                    │              │              └─> Embeddings
                                                    │              │                 (redacted only)
                                                    │              │
                                                    │              └─> Token vault
                                                    │                 (encrypted,
                                                    │                  reversible map)
                                                    │
                                                    └─> Sensitivity tier
                                                        (drives provider routing)

The classifier and redactor are deliberately separated: the classifier assigns a sensitivity tier to the whole document (and to each chunk), while the redactor masks specific spans. A document may be marked privileged as a whole (pinned to the on-prem model) even after individual PII spans are redacted — the two decisions compose.

3. Detection: Regex + NER + Domain Lexicons

No single technique catches everything. The production pipeline layers three:

Regex for structured identifiers with low false-positive rates (SSN, credit-card Luhn-valid numbers, phone, email, IPv4/IPv6).
spaCy NER (en_core_web_trf for accuracy, en_core_web_sm for throughput) for PERSON, ORG, GPE, DATE, MONEY spans that regex cannot express.
Domain lexicons — curated lists of matter codes, client names, opposing-counsel firms, and custodian names loaded per-tenant. This is the piece that distinguishes privileged context from generic business text.

Example: Minimal Redactor

import re
import hashlib
import hmac
from dataclasses import dataclass
from typing import Iterable

import spacy

# Load once at process start; en_core_web_trf for higher accuracy on legal text.
NLP = spacy.load("en_core_web_sm")

# Regex patterns with low false-positive rates.
PATTERNS = {
    "SSN":   re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "EMAIL": re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"),
    "PHONE": re.compile(r"\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"),
    "CC":    re.compile(r"\b(?:\d[ -]*?){13,16}\b"),  # Luhn-check downstream
}

SPACY_LABELS = {"PERSON", "ORG", "GPE", "DATE", "MONEY"}


@dataclass
class Span:
    start: int
    end: int
    label: str
    text: str


def _detect_regex(text: str) -> Iterable[Span]:
    for label, pat in PATTERNS.items():
        for m in pat.finditer(text):
            yield Span(m.start(), m.end(), label, m.group())


def _detect_ner(text: str) -> Iterable[Span]:
    doc = NLP(text)
    for ent in doc.ents:
        if ent.label_ in SPACY_LABELS:
            yield Span(ent.start_char, ent.end_char, ent.label_, ent.text)


def _merge(spans: list[Span]) -> list[Span]:
    """Collapse overlapping spans; longer match wins, ties broken by label priority."""
    priority = {"SSN": 0, "CC": 1, "EMAIL": 2, "PHONE": 3, "PERSON": 4,
                "ORG": 5, "GPE": 6, "DATE": 7, "MONEY": 8}
    spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start)))
    out: list[Span] = []
    for s in spans:
        if out and s.start < out[-1].end:
            if (s.end - s.start) > (out[-1].end - out[-1].start):
                out[-1] = s
            elif priority.get(s.label, 99) < priority.get(out[-1].label, 99):
                out[-1] = s
            continue
        out.append(s)
    return out


def _token_for(span: Span, secret: bytes) -> str:
    """Deterministic, non-reversible token. Re-identification uses the vault, not this."""
    h = hmac.new(secret, f"{span.label}:{span.text}".encode(), hashlib.sha256).hexdigest()[:12]
    return f"[{span.label}:{h}]"


def redact(text: str, secret: bytes) -> tuple[str, list[Span]]:
    spans = _merge(list(_detect_regex(text)) + list(_detect_ner(text)))
    out, cursor = [], 0
    for s in spans:
        out.append(text[cursor:s.start])
        out.append(_token_for(s, secret))
        cursor = s.end
    out.append(text[cursor:])
    return "".join(out), spans


if __name__ == "__main__":
    sample = (
        "Jane Doe (SSN 123-45-6789, jane@example.com) met with counsel "
        "at Acme Corp on March 4, 2024 regarding matter M-2024-071."
    )
    redacted, spans = redact(sample, secret=b"rotate-me-per-tenant")
    print(redacted)
    for s in spans:
        print(f"  {s.label:<6} @ [{s.start}:{s.end}] -> {s.text!r}")

The HMAC token is deterministic within a tenant — so the same SSN redacts to the same placeholder across chunks, preserving coreference for retrieval — but it is not reversible on its own. Re-identification is a separate, audited path (see section 5).

4. Privileged-Document Detection

Privilege is a document-level property, not a span-level one. The classifier looks at metadata and structure, not just content:

from dataclasses import dataclass
from enum import Enum


class Tier(str, Enum):
    PUBLIC       = "public"
    CONFIDENTIAL = "confidential"
    PRIVILEGED   = "privileged"
    REGULATED    = "regulated"


@dataclass
class DocumentContext:
    matter_id: str | None
    custodian_role: str | None        # "attorney" | "paralegal" | "client" | ...
    header_markers: list[str]
    body_text: str


PRIVILEGE_MARKERS = {
    "privileged and confidential",
    "attorney-client privileged",
    "attorney work product",
    "prepared in anticipation of litigation",
    "subject to fre 408",
    "mediation privileged",
}


def classify(ctx: DocumentContext) -> Tier:
    headers = {h.lower() for h in ctx.header_markers}
    if headers & PRIVILEGE_MARKERS:
        return Tier.PRIVILEGED

    if ctx.matter_id and ctx.custodian_role in {"attorney", "co-counsel"}:
        # Communications authored by or addressed to counsel on an active matter.
        return Tier.PRIVILEGED

    if any(m in ctx.body_text.lower() for m in PRIVILEGE_MARKERS):
        return Tier.PRIVILEGED

    # Fall-through classifiers for PHI / PCI not shown.
    return Tier.CONFIDENTIAL

The Tier assigned here is the single input the provider-routing layer uses to decide whether a request may reach a cloud model at all. A document marked PRIVILEGED is pinned to the on-prem model for the remainder of its lifecycle — the decision is made by policy at ingest, not by a user choosing a dropdown at query time.

5. Reversible Tokenization for Authorized Re-identification

Lawyers need to see the real names. Redacted placeholders are useless if a partner reviewing a summary cannot tell which plaintiff is which. The solution is a token vault: redaction replaces each span with a token, and the vault stores the mapping token → original under envelope encryption with a per-matter data key.

Redaction output (what the model sees) contains only tokens.
The vault row is encrypted with a data key wrapped by a per-matter KMS CMK.
Re-identification on render requires the caller to pass the RBAC check for that matter; every re-identification is logged as a separate event in the audit trail.
Keys rotate on a schedule; vault rows are re-wrapped without changing tokens.

import boto3
from cryptography.fernet import Fernet

kms = boto3.client("kms")


def wrap_data_key(matter_cmk_arn: str) -> tuple[bytes, bytes]:
    """Generate a data key; return (plaintext, ciphertext) pair."""
    resp = kms.generate_data_key(KeyId=matter_cmk_arn, KeySpec="AES_256")
    return resp["Plaintext"], resp["CiphertextBlob"]


def vault_put(token: str, original: str, matter_cmk_arn: str, store) -> None:
    plaintext_key, wrapped_key = wrap_data_key(matter_cmk_arn)
    # Fernet expects a url-safe base64 32-byte key.
    import base64
    f = Fernet(base64.urlsafe_b64encode(plaintext_key))
    ciphertext = f.encrypt(original.encode())
    store.put(token=token, wrapped_key=wrapped_key, ciphertext=ciphertext)
    del plaintext_key  # best-effort clear


def vault_get(token: str, actor, matter_id, store, audit) -> str:
    if not actor.may_read_matter(matter_id):
        audit.log("reidentify.denied", actor=actor.id, matter=matter_id, token=token)
        raise PermissionError("actor lacks access to this matter")
    row = store.get(token)
    plaintext_key = kms.decrypt(CiphertextBlob=row.wrapped_key)["Plaintext"]
    import base64
    f = Fernet(base64.urlsafe_b64encode(plaintext_key))
    original = f.decrypt(row.ciphertext).decode()
    audit.log("reidentify.ok", actor=actor.id, matter=matter_id, token=token)
    return original

6. Failure Modes and the Quarantine Path

A redactor that silently lets PII through is worse than one that refuses to process a document. The pipeline enforces three guardrails:

Confidence threshold — Spans below the NER confidence threshold are redacted anyway; false positives are cheaper than false negatives.
Unknown-language quarantine — Documents whose language cannot be identified, or for which no NER model is loaded, go to a quarantine queue and are never persisted to the vector store.
Privilege-marker override — A document carrying a privilege header marker is routed to on-prem regardless of other classification signals; the classifier cannot downgrade it.

Every quarantine event is logged with the document hash, upload actor, and reason code, so compliance can review why a file was held back.

7. Testing the Redactor

Redaction logic is tested against a synthetic corpus of ~2,000 documents with hand-labeled PII and privilege spans. The test suite asserts two properties:

Recall >= 0.99 on labeled PII — false negatives are the expensive error; a regression here blocks the release.
Precision >= 0.90 — over-redaction degrades retrieval quality but does not leak data, so it is caught in review rather than gating CI.

import json
import pytest

from redactor import redact


@pytest.fixture(scope="session")
def corpus():
    with open("tests/fixtures/pii_corpus.jsonl") as f:
        return [json.loads(line) for line in f]


def test_pii_recall(corpus):
    tp = fn = 0
    for doc in corpus:
        _, spans = redact(doc["text"], secret=b"test")
        found = {(s.start, s.end) for s in spans}
        gold  = {(g["start"], g["end"]) for g in doc["spans"]}
        tp += len(found & gold)
        fn += len(gold - found)
    recall = tp / (tp + fn) if (tp + fn) else 1.0
    assert recall >= 0.99, f"redactor recall dropped to {recall:.3f}"

8. What This Buys You

Model providers never see raw PII or privileged text — the prompt and retrieval context are already redacted before the outbound call.
Embeddings cannot leak identifiers — the vectors are computed over redacted text, so similarity search cannot return a result whose relevance depends on a name, SSN, or account number.
Re-identification is an audited, RBAC-gated action — not a side effect of reading a document.
Privilege routing is a policy decision, not a user decision — the on-prem pin is assigned at ingest and cannot be bypassed by a user choosing a different model at query time.

↑ Back to Top