Personally Identifiable Information (PII) & Privileged-Content Redaction on Ingest

In a legal document-intelligence platform, the single riskiest moment in the data lifecycle is ingest. Once a document has been split, embedded, and written into a vector store, any personally identifiable information (PII) or attorney–client privileged material it contains is effectively published to every downstream system that can read that store — including retrieval caches, prompt logs, and the context windows of external model providers. Redaction must happen before persistence, not after.

This page describes the redaction pipeline used on ingest: what is detected, how it is masked, how reversibility is preserved for authorized users, and how the system fails safely when classification confidence is low.



1. What Counts as PII and Privileged Content

The pipeline recognizes three overlapping categories, each with its own handling policy:


2. Pipeline Overview

Upload ──> Virus scan ──> Text extraction ──> Classifier ──> Redactor ──> Persistence
                                                    │              │              │
                                                    │              │              └─> Embeddings
                                                    │              │                 (redacted only)
                                                    │              │
                                                    │              └─> Token vault
                                                    │                 (encrypted,
                                                    │                  reversible map)
                                                    │
                                                    └─> Sensitivity tier
                                                        (drives provider routing)

The classifier and redactor are deliberately separated: the classifier assigns a sensitivity tier to the whole document (and to each chunk), while the redactor masks specific spans. A document may be marked privileged as a whole (pinned to the on-prem model) even after individual PII spans are redacted — the two decisions compose.


3. Detection: Regex + NER + Domain Lexicons

No single technique catches everything. The production pipeline layers three:

Example: Minimal Redactor

import re
import hashlib
import hmac
from dataclasses import dataclass
from typing import Iterable

import spacy

# Load once at process start; en_core_web_trf for higher accuracy on legal text.
NLP = spacy.load("en_core_web_sm")

# Regex patterns with low false-positive rates.
PATTERNS = {
    "SSN":   re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "EMAIL": re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"),
    "PHONE": re.compile(r"\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"),
    "CC":    re.compile(r"\b(?:\d[ -]*?){13,16}\b"),  # Luhn-check downstream
}

SPACY_LABELS = {"PERSON", "ORG", "GPE", "DATE", "MONEY"}


@dataclass
class Span:
    start: int
    end: int
    label: str
    text: str


def _detect_regex(text: str) -> Iterable[Span]:
    for label, pat in PATTERNS.items():
        for m in pat.finditer(text):
            yield Span(m.start(), m.end(), label, m.group())


def _detect_ner(text: str) -> Iterable[Span]:
    doc = NLP(text)
    for ent in doc.ents:
        if ent.label_ in SPACY_LABELS:
            yield Span(ent.start_char, ent.end_char, ent.label_, ent.text)


def _merge(spans: list[Span]) -> list[Span]:
    """Collapse overlapping spans; longer match wins, ties broken by label priority."""
    priority = {"SSN": 0, "CC": 1, "EMAIL": 2, "PHONE": 3, "PERSON": 4,
                "ORG": 5, "GPE": 6, "DATE": 7, "MONEY": 8}
    spans = sorted(spans, key=lambda s: (s.start, -(s.end - s.start)))
    out: list[Span] = []
    for s in spans:
        if out and s.start < out[-1].end:
            if (s.end - s.start) > (out[-1].end - out[-1].start):
                out[-1] = s
            elif priority.get(s.label, 99) < priority.get(out[-1].label, 99):
                out[-1] = s
            continue
        out.append(s)
    return out


def _token_for(span: Span, secret: bytes) -> str:
    """Deterministic, non-reversible token. Re-identification uses the vault, not this."""
    h = hmac.new(secret, f"{span.label}:{span.text}".encode(), hashlib.sha256).hexdigest()[:12]
    return f"[{span.label}:{h}]"


def redact(text: str, secret: bytes) -> tuple[str, list[Span]]:
    spans = _merge(list(_detect_regex(text)) + list(_detect_ner(text)))
    out, cursor = [], 0
    for s in spans:
        out.append(text[cursor:s.start])
        out.append(_token_for(s, secret))
        cursor = s.end
    out.append(text[cursor:])
    return "".join(out), spans


if __name__ == "__main__":
    sample = (
        "Jane Doe (SSN 123-45-6789, jane@example.com) met with counsel "
        "at Acme Corp on March 4, 2024 regarding matter M-2024-071."
    )
    redacted, spans = redact(sample, secret=b"rotate-me-per-tenant")
    print(redacted)
    for s in spans:
        print(f"  {s.label:<6} @ [{s.start}:{s.end}] -> {s.text!r}")

The HMAC token is deterministic within a tenant — so the same SSN redacts to the same placeholder across chunks, preserving coreference for retrieval — but it is not reversible on its own. Re-identification is a separate, audited path (see section 5).


4. Privileged-Document Detection

Privilege is a document-level property, not a span-level one. The classifier looks at metadata and structure, not just content:

from dataclasses import dataclass
from enum import Enum


class Tier(str, Enum):
    PUBLIC       = "public"
    CONFIDENTIAL = "confidential"
    PRIVILEGED   = "privileged"
    REGULATED    = "regulated"


@dataclass
class DocumentContext:
    matter_id: str | None
    custodian_role: str | None        # "attorney" | "paralegal" | "client" | ...
    header_markers: list[str]
    body_text: str


PRIVILEGE_MARKERS = {
    "privileged and confidential",
    "attorney-client privileged",
    "attorney work product",
    "prepared in anticipation of litigation",
    "subject to fre 408",
    "mediation privileged",
}


def classify(ctx: DocumentContext) -> Tier:
    headers = {h.lower() for h in ctx.header_markers}
    if headers & PRIVILEGE_MARKERS:
        return Tier.PRIVILEGED

    if ctx.matter_id and ctx.custodian_role in {"attorney", "co-counsel"}:
        # Communications authored by or addressed to counsel on an active matter.
        return Tier.PRIVILEGED

    if any(m in ctx.body_text.lower() for m in PRIVILEGE_MARKERS):
        return Tier.PRIVILEGED

    # Fall-through classifiers for PHI / PCI not shown.
    return Tier.CONFIDENTIAL

The Tier assigned here is the single input the provider-routing layer uses to decide whether a request may reach a cloud model at all. A document marked PRIVILEGED is pinned to the on-prem model for the remainder of its lifecycle — the decision is made by policy at ingest, not by a user choosing a dropdown at query time.


5. Reversible Tokenization for Authorized Re-identification

Lawyers need to see the real names. Redacted placeholders are useless if a partner reviewing a summary cannot tell which plaintiff is which. The solution is a token vault: redaction replaces each span with a token, and the vault stores the mapping token → original under envelope encryption with a per-matter data key.

import boto3
from cryptography.fernet import Fernet

kms = boto3.client("kms")


def wrap_data_key(matter_cmk_arn: str) -> tuple[bytes, bytes]:
    """Generate a data key; return (plaintext, ciphertext) pair."""
    resp = kms.generate_data_key(KeyId=matter_cmk_arn, KeySpec="AES_256")
    return resp["Plaintext"], resp["CiphertextBlob"]


def vault_put(token: str, original: str, matter_cmk_arn: str, store) -> None:
    plaintext_key, wrapped_key = wrap_data_key(matter_cmk_arn)
    # Fernet expects a url-safe base64 32-byte key.
    import base64
    f = Fernet(base64.urlsafe_b64encode(plaintext_key))
    ciphertext = f.encrypt(original.encode())
    store.put(token=token, wrapped_key=wrapped_key, ciphertext=ciphertext)
    del plaintext_key  # best-effort clear


def vault_get(token: str, actor, matter_id, store, audit) -> str:
    if not actor.may_read_matter(matter_id):
        audit.log("reidentify.denied", actor=actor.id, matter=matter_id, token=token)
        raise PermissionError("actor lacks access to this matter")
    row = store.get(token)
    plaintext_key = kms.decrypt(CiphertextBlob=row.wrapped_key)["Plaintext"]
    import base64
    f = Fernet(base64.urlsafe_b64encode(plaintext_key))
    original = f.decrypt(row.ciphertext).decode()
    audit.log("reidentify.ok", actor=actor.id, matter=matter_id, token=token)
    return original

6. Failure Modes and the Quarantine Path

A redactor that silently lets PII through is worse than one that refuses to process a document. The pipeline enforces three guardrails:

Every quarantine event is logged with the document hash, upload actor, and reason code, so compliance can review why a file was held back.


7. Testing the Redactor

Redaction logic is tested against a synthetic corpus of ~2,000 documents with hand-labeled PII and privilege spans. The test suite asserts two properties:

import json
import pytest

from redactor import redact


@pytest.fixture(scope="session")
def corpus():
    with open("tests/fixtures/pii_corpus.jsonl") as f:
        return [json.loads(line) for line in f]


def test_pii_recall(corpus):
    tp = fn = 0
    for doc in corpus:
        _, spans = redact(doc["text"], secret=b"test")
        found = {(s.start, s.end) for s in spans}
        gold  = {(g["start"], g["end"]) for g in doc["spans"]}
        tp += len(found & gold)
        fn += len(gold - found)
    recall = tp / (tp + fn) if (tp + fn) else 1.0
    assert recall >= 0.99, f"redactor recall dropped to {recall:.3f}"

8. What This Buys You


↑ Back to Top