Right-to-Erasure (GDPR Art. 17) in a Vector Store

GDPR Article 17 — the right to erasure, colloquially "right to be forgotten" — requires that, when a data subject withdraws consent or objects to processing, the controller deletes their personal data "without undue delay." For a relational database the implementation is familiar: DELETE WHERE user_id = ?. For a RAG pipeline with embeddings, summaries, fine-tunes, and retrieval caches, the data in scope has proliferated into many derivatives — and the vector representation of a sentence is still personal data when it was computed from one.

1. What "Delete My Data" Actually Means

An erasure request reaches into every derivative of the subject's content:

Raw uploaded documents (object storage).
Extracted text and chunks (staging stores).
Redacted projections (what downstream systems actually see).
Embeddings in the vector index.
Retrieval result caches, prompt logs, and response logs.
Derived summaries, extractions, and tool-call outputs.
Fine-tuned model weights trained on the data (the hardest case).
Audit logs — often explicitly excluded from erasure under legal hold, but the exemption must be documented.

2. Tombstoning & Soft Delete

Hard-deleting embeddings from a live index is expensive and often impossible in approximate-nearest-neighbor structures (HNSW, IVF) without a full rebuild. Tombstoning is the standard interim: mark the vector as deleted, filter it from search results, and compact on a schedule.

Tombstone at query time — the retriever filters out tombstoned IDs before ranking; the document is effectively gone from user-facing results within milliseconds of the request.
Compact asynchronously — an erasure job rebuilds the affected shard within the GDPR "undue delay" window (typically measured in days, not hours, but memorialize the internal SLA).
Verify — after compaction, a regression test re-queries known-deleted content and asserts zero hits.

3. Tenant-Scoped Indexes

Keep indexes scoped to the narrowest safe unit — per tenant, and inside a tenant per matter. When a whole matter is erased, the cheapest implementation is to drop its index outright. This is dramatically cheaper than erasing individual vectors from a shared index and removes ambiguity about cross-matter residue.

4. Re-embedding Schedules

Embedding models change. When you re-embed the corpus (new model version, new chunking strategy), the erasure work must re-run: old vectors are rebuilt only from documents that still exist in the raw store. Wire erasure into the re-embed pipeline so a tombstoned raw document cannot accidentally produce fresh embeddings on the next rebuild.

5. Example: Erasure Handler

from dataclasses import dataclass
from datetime import datetime, timezone


@dataclass
class ErasureRequest:
    tenant_id: str
    subject_id: str          # the data subject
    reason: str              # "gdpr-art17" | "ccpa-deletion" | "contract-termination"
    received_at: datetime


def handle_erasure(req: ErasureRequest, stores, audit) -> None:
    audit.log("erasure.start", tenant=req.tenant_id, subject=req.subject_id,
              reason=req.reason)

    # 1. Identify every document tied to the subject.
    doc_ids = stores.metadata.docs_for_subject(req.tenant_id, req.subject_id)

    # 2. Tombstone the raw documents so re-embed jobs skip them.
    for d in doc_ids:
        stores.raw.mark_tombstoned(d, ts=req.received_at)

    # 3. Tombstone vectors and purge retrieval cache immediately (user-visible).
    stores.vectors.tombstone_by_doc(doc_ids)
    stores.retrieval_cache.purge_by_doc(doc_ids)

    # 4. Purge prompt/response logs EXCEPT audit trail (legal-hold exemption).
    stores.prompt_logs.purge_by_subject(req.tenant_id, req.subject_id,
                                        keep_audit=True)

    # 5. Schedule index compaction within the internal SLA.
    stores.jobs.enqueue("compact_index", tenant=req.tenant_id,
                        run_after=req.received_at, priority="gdpr")

    # 6. Record the completion promise; follow up after compaction verifies.
    audit.log("erasure.tombstoned", tenant=req.tenant_id,
              subject=req.subject_id, doc_count=len(doc_ids))


def verify_erasure(tenant_id: str, subject_id: str, stores) -> bool:
    """Post-compaction check: no live records remain."""
    return (
        stores.vectors.count_live_by_subject(tenant_id, subject_id) == 0 and
        stores.retrieval_cache.count_by_subject(tenant_id, subject_id) == 0
    )

6. The Hard Cases: Fine-Tunes & Caches

Fine-tuned weights — once a model is fine-tuned on data, retraining is the only reliable erasure path. Mitigation: do not fine-tune on per-subject raw content. Train on redacted or synthetic derivatives so no model ever memorizes identifiers. Document this in the data-processing agreement so Art. 17 can be satisfied without a retrain.
External provider caches — cloud model providers may log prompts. Contractual commitments and "zero retention" modes (where available) are the enforcement path; record which provider/mode each request used so you can respond to a DSAR with specifics.
Backups — most regulators accept that erasure applies on next-restore rather than requiring backup modification, provided the retention window is bounded and documented.
Third-party integrations — data exported to e-sign, CRM, or email must be chased through each system's own erasure API.

↑ Back to Top