Structured-PII Detection with Microsoft Presidio

The regex + spaCy pipeline on the PII redaction page is fast and transparent, but extending it to new entity types (bar numbers, docket numbers, FDA submission IDs, billing codes) means writing glue code for every case. Microsoft Presidio is an open-source framework that normalizes this work: pluggable recognizers, confidence scoring, language-aware NLP engines, and reversible anonymizers all share a common RecognizerResult shape.

In production it sits as one layer of a composite classifier — not a replacement for the hand-written regex/NER stack, but a way to plug in domain-specific detectors without growing the core pipeline.



1. What Presidio Gives You


2. Architecture: Analyzer + Anonymizer

The pipeline has two stages: the analyzer produces RecognizerResult spans with entity type, start/end offsets, and confidence. The anonymizer consumes those spans and applies a chosen operator per entity type. They are independent services, which matches the redactor/tokenizer separation on the PII page.


3. Example: Custom Recognizer for Bar & Docket Numbers

from presidio_analyzer import (
    AnalyzerEngine,
    Pattern,
    PatternRecognizer,
    RecognizerRegistry,
)
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig


# --- Custom recognizers for legal identifiers ---
bar_pattern = Pattern(
    name="CA bar number",
    regex=r"\b(?:CA\s*Bar\s*(?:No\.?|#)?\s*)(\d{5,7})\b",
    score=0.85,
)
bar_recognizer = PatternRecognizer(
    supported_entity="BAR_NUMBER",
    patterns=[bar_pattern],
    context=["bar", "attorney", "counsel"],
)

docket_pattern = Pattern(
    name="Federal docket",
    regex=r"\b\d{1,2}:\d{2}-[a-z]{2}-\d{5}\b",   # e.g. 3:24-cv-01234
    score=0.9,
)
docket_recognizer = PatternRecognizer(
    supported_entity="DOCKET_NUMBER",
    patterns=[docket_pattern],
    context=["docket", "case", "civil action"],
)


def build_analyzer() -> AnalyzerEngine:
    nlp_engine = NlpEngineProvider(nlp_configuration={
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }).create_engine()

    registry = RecognizerRegistry()
    registry.load_predefined_recognizers()
    registry.add_recognizer(bar_recognizer)
    registry.add_recognizer(docket_recognizer)

    return AnalyzerEngine(nlp_engine=nlp_engine, registry=registry,
                          supported_languages=["en"])


if __name__ == "__main__":
    analyzer = build_analyzer()
    anonymizer = AnonymizerEngine()

    text = (
        "Per CA Bar No. 234567, counsel appeared in 3:24-cv-01234 on behalf "
        "of Jane Doe (jane@example.com, SSN 123-45-6789)."
    )

    results = analyzer.analyze(text=text, language="en")
    for r in results:
        print(r)   # RecognizerResult(entity_type=..., start=..., end=..., score=...)

    anon = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "DEFAULT":        OperatorConfig("replace", {"new_value": "[REDACTED]"}),
            "BAR_NUMBER":     OperatorConfig("mask",
                                             {"chars_to_mask": 10, "masking_char": "*",
                                              "from_end": True}),
            "DOCKET_NUMBER":  OperatorConfig("hash", {"hash_type": "sha256"}),
            "US_SSN":         OperatorConfig("replace", {"new_value": "[SSN]"}),
            "EMAIL_ADDRESS":  OperatorConfig("replace", {"new_value": "[EMAIL]"}),
        },
    )
    print(anon.text)

4. Composing with the In-House Pipeline

Presidio results are merged with the hand-written regex/spaCy spans from the PII redaction page. The merge step resolves overlaps by confidence and category priority (the same ordering used inside the house redactor):


5. Confidence Thresholds & Context


6. Trade-offs vs. Rolling Your Own


↑ Back to Top