Right-to-Erasure (GDPR Art. 17) in a Vector Store

GDPR Article 17 — the right to erasure, colloquially "right to be forgotten" — requires that, when a data subject withdraws consent or objects to processing, the controller deletes their personal data "without undue delay." For a relational database the implementation is familiar: DELETE WHERE user_id = ?. For a RAG pipeline with embeddings, summaries, fine-tunes, and retrieval caches, the data in scope has proliferated into many derivatives — and the vector representation of a sentence is still personal data when it was computed from one.



1. What "Delete My Data" Actually Means

An erasure request reaches into every derivative of the subject's content:


2. Tombstoning & Soft Delete

Hard-deleting embeddings from a live index is expensive and often impossible in approximate-nearest-neighbor structures (HNSW, IVF) without a full rebuild. Tombstoning is the standard interim: mark the vector as deleted, filter it from search results, and compact on a schedule.


3. Tenant-Scoped Indexes

Keep indexes scoped to the narrowest safe unit — per tenant, and inside a tenant per matter. When a whole matter is erased, the cheapest implementation is to drop its index outright. This is dramatically cheaper than erasing individual vectors from a shared index and removes ambiguity about cross-matter residue.


4. Re-embedding Schedules

Embedding models change. When you re-embed the corpus (new model version, new chunking strategy), the erasure work must re-run: old vectors are rebuilt only from documents that still exist in the raw store. Wire erasure into the re-embed pipeline so a tombstoned raw document cannot accidentally produce fresh embeddings on the next rebuild.


5. Example: Erasure Handler

from dataclasses import dataclass
from datetime import datetime, timezone


@dataclass
class ErasureRequest:
    tenant_id: str
    subject_id: str          # the data subject
    reason: str              # "gdpr-art17" | "ccpa-deletion" | "contract-termination"
    received_at: datetime


def handle_erasure(req: ErasureRequest, stores, audit) -> None:
    audit.log("erasure.start", tenant=req.tenant_id, subject=req.subject_id,
              reason=req.reason)

    # 1. Identify every document tied to the subject.
    doc_ids = stores.metadata.docs_for_subject(req.tenant_id, req.subject_id)

    # 2. Tombstone the raw documents so re-embed jobs skip them.
    for d in doc_ids:
        stores.raw.mark_tombstoned(d, ts=req.received_at)

    # 3. Tombstone vectors and purge retrieval cache immediately (user-visible).
    stores.vectors.tombstone_by_doc(doc_ids)
    stores.retrieval_cache.purge_by_doc(doc_ids)

    # 4. Purge prompt/response logs EXCEPT audit trail (legal-hold exemption).
    stores.prompt_logs.purge_by_subject(req.tenant_id, req.subject_id,
                                        keep_audit=True)

    # 5. Schedule index compaction within the internal SLA.
    stores.jobs.enqueue("compact_index", tenant=req.tenant_id,
                        run_after=req.received_at, priority="gdpr")

    # 6. Record the completion promise; follow up after compaction verifies.
    audit.log("erasure.tombstoned", tenant=req.tenant_id,
              subject=req.subject_id, doc_count=len(doc_ids))


def verify_erasure(tenant_id: str, subject_id: str, stores) -> bool:
    """Post-compaction check: no live records remain."""
    return (
        stores.vectors.count_live_by_subject(tenant_id, subject_id) == 0 and
        stores.retrieval_cache.count_by_subject(tenant_id, subject_id) == 0
    )

6. The Hard Cases: Fine-Tunes & Caches


↑ Back to Top