Output Filtering & Canary Tokens

Ingest-time redaction (see the redaction page) removes PII and privileged content from prompts before inference. But an LLM can still synthesize or reconstruct sensitive content in its response — by paraphrasing a tokenized name back into the original, by guessing identifiers from context, or by echoing content that slipped through the redactor. Output filtering is the last line of defense: every response is scanned before it leaves the trust boundary.

Canary tokens complement filtering by turning exfiltration attempts into detectable events: unique strings planted in prompts that, if they ever appear in an outbound response or external system, reveal a leak path.



1. Why Filter Output

Three failure classes motivate output filtering:


2. Leak-Detection Scanner

The scanner runs the same regex + NER stack as the redactor, but over the response. Any PII it finds in an outbound message is a leak by definition — the prompt was already redacted, so the model should have nothing left to reveal.

from dataclasses import dataclass
from redactor import redact   # same module from the PII page


@dataclass
class LeakReport:
    leaked_spans: list
    canary_hits: list
    redacted_reemerged: list


def scan_response(response: str, prompt_tokens: set[str],
                  canaries: set[str], secret: bytes) -> LeakReport:
    _, spans = redact(response, secret=secret)

    # 1. Any new PII in the response is a leak.
    leaked = [s for s in spans]

    # 2. A redacted placeholder that appeared in the prompt should stay a
    #    placeholder in the response. If the ORIGINAL text resurfaces, that is
    #    model reconstruction.
    reemerged = []
    for s in spans:
        token = f"[{s.label}:...]"   # lookup real token via vault in production
        if token in prompt_tokens and s.text not in prompt_tokens:
            reemerged.append(s)

    # 3. Canary tokens must never appear in output.
    canary_hits = [c for c in canaries if c in response]

    return LeakReport(leaked, canary_hits, reemerged)

3. Re-emergence of Redacted Spans

The most subtle failure: the prompt contains [PERSON:a3f91b2c] and the response contains Jane Doe. The model has re-identified the redacted span, either from context clues ("the CFO mentioned earlier") or from parametric knowledge (if Jane Doe is a well-known public figure). To detect this, the scanner needs the token→original map from the vault for the current session and checks whether any response span matches an original whose token appears in the prompt.

When re-emergence is detected, the default policy is refuse and log rather than re-mask — because the mere fact that the model could reconstruct the identifier is information leakage about the underlying data.


4. Canary Tokens

Plant a unique, high-entropy string in each prompt (or in sentinel documents in the corpus). If that string ever appears in: (a) an outbound response, (b) a model provider's logs, (c) a third-party tool's request, or (d) an external mail/HTTP destination, you have a verifiable leak signal and a timestamp narrowing the source.

import secrets


def mint_canary(session_id: str) -> str:
    # High-entropy, URL-safe, recognizable prefix for grep-ability.
    return f"CANARY-{session_id[:6]}-{secrets.token_urlsafe(16)}"


def inject_canary_into_system_prompt(system_prompt: str, canary: str) -> str:
    return (
        f"{system_prompt}\n\n"
        f"Internal trace id: {canary}. "
        "Do not repeat this trace id in any response or tool call."
    )


def webhook_canary_alert(canary: str, where: str, detail: dict) -> None:
    # Page oncall; canary sighting is a high-signal event.
    alert.page(severity="high", title=f"Canary {canary} seen in {where}",
               detail=detail)

Canary tokens also appear in sentinel rows in the vector store: a fake "matter" containing a canary string, never referenced in any legitimate query. Any retrieval that hits the sentinel, or any response that echoes its canary, is definitionally abnormal.


5. Response Policy: Block, Mask, or Refuse

The scanner reports a leak; the orchestration layer chooses a policy:


6. Tuning Precision vs. Recall

Output filtering has the opposite tuning from ingest redaction. At ingest, recall matters most — false negatives leak data. At output, precision matters too — false positives make the assistant unusable when every response is refused. The recommended approach:


↑ Back to Top