Large-Scale RAG: 10M Documents, 100k Users, p95 < 2s

The brief is concrete: build a retrieval-augmented generation service that indexes 10 million documents, serves 100,000 monthly active users, and returns grounded answers with a p95 latency under 2 seconds end-to-end. The corpus is a mix of PDFs, internal wiki pages, and Slack exports averaging 4 KB of cleaned text per document. Users ask natural-language questions and expect citations to source spans.

Most "RAG demo to production" failures come from skipping the boring work: hybrid retrieval, a real reranker, query-result caching, and a freshness strategy that does not melt the embedding bill. This walk-through is opinionated about all four.



1. Problem & Functional Requirements

The system must:

Out of scope for this design: agentic tool use, multi-turn conversational memory beyond the current question, and fine-tuning. Those are separate problems.


2. Non-Functional Requirements & SLOs

Metric Target
End-to-end p50 latency800 ms
End-to-end p95 latency2,000 ms
End-to-end p99 latency4,000 ms
Sustained QPS50 (peak ~200)
Availability99.9% monthly (43 min downtime budget)
Document freshness≤ 5 min from upload to searchable
Cost ceiling< $0.02 per query all-in

The 50 QPS figure is derived from 100k MAU × 30 queries/user/month ÷ (30 days × 86400 s) ≈ 1.2 QPS average, with 4× daily peak and 10× burst headroom. Plan for 200 QPS sustained-burst.


3. Capacity Estimates

Chunking and embedding count. 10M docs × 4 KB cleaned text ÷ 1 KB per chunk (about 250 tokens) ≈ 40M chunks.

Embedding storage. Using bge-large-en-v1.5 at 1024 dimensions, float16 = 2 bytes/dim:

Embedding throughput at ingest. Initial backfill of 40M chunks on a single A10G running vLLM-served bge-large batched at 256 sequences ≈ 3,000 chunks/sec → 40M / 3,000 ≈ 3.7 hours. With 4× parallelism < 1 hour. After backfill, steady-state < 1 chunk/sec.

Query-time inference. Embedding the user query is one forward pass ≈ 15 ms on an L4. Cross-encoder rerank of top-50 candidates with bge-reranker-large ≈ 60 ms batched. LLM answer generation (Claude Haiku or Llama 3.1 8B) at 500 input tokens + 200 output tokens ≈ 700 ms. Vector search ANN k=50 ≈ 30 ms.

Latency budget breakdown (p95 target 2000 ms):

query_embedding:        50 ms   # cold tokenizer + forward pass
hybrid_retrieval:      120 ms   # BM25 + dense ANN in parallel
rerank_top50:          150 ms   # cross-encoder, batched
context_assembly:       30 ms   # fetch chunk text + dedupe
llm_generation:      1,400 ms   # 700 ms typical + 700 ms tail
network/serialization: 250 ms
---
total p95:           2,000 ms

4. High-Level Architecture

                  +------------------+
   user query --->|  API Gateway     |  (auth, tenant_id injection, rate limit)
                  +--------+---------+
                           |
                           v
                  +------------------+        +-------------------+
                  |  Query Planner   |<------>|  Redis Cache      |  (query_hash -> answer)
                  +--------+---------+        +-------------------+
                           |
            +--------------+--------------+
            |                             |
            v                             v
  +-------------------+         +-------------------+
  | Embed Service     |         |  BM25 (Postgres   |
  | (vLLM, bge-large) |         |   GIN / OpenSearch)|
  +---------+---------+         +---------+---------+
            |                             |
            v                             |
  +-------------------+                   |
  | Vector Store      |                   |
  | (pgvector HNSW)   |                   |
  +---------+---------+                   |
            |                             |
            +--------------+--------------+
                           |
                           v
                  +------------------+
                  |  Reranker        |  (cross-encoder, top-50 -> top-8)
                  +--------+---------+
                           |
                           v
                  +------------------+
                  |  LLM Generator   |  (Claude Haiku via Bedrock OR vLLM Llama 8B)
                  +--------+---------+
                           |
                           v
                  +------------------+
                  |  Response        |  (answer + citations + confidence)
                  +------------------+

Component responsibilities, one line each:


5. Data Model & Storage

Postgres is the source of truth for both chunk metadata and vectors. One database, one transaction boundary, fewer integration bugs.

CREATE TABLE documents (
    doc_id        UUID PRIMARY KEY,
    tenant_id     UUID NOT NULL,
    source_uri    TEXT NOT NULL,
    title         TEXT,
    sha256        BYTEA NOT NULL,           -- content hash for dedupe + idempotent re-ingest
    created_at    TIMESTAMPTZ DEFAULT now(),
    updated_at    TIMESTAMPTZ DEFAULT now(),
    deleted_at    TIMESTAMPTZ              -- soft-delete; physically purged by GC job
);
CREATE INDEX ON documents (tenant_id, updated_at);

CREATE TABLE chunks (
    chunk_id      UUID PRIMARY KEY,
    doc_id        UUID NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
    tenant_id     UUID NOT NULL,            -- denormalized for RLS predicate efficiency
    ordinal       INT NOT NULL,             -- chunk index within document
    text          TEXT NOT NULL,
    text_tsv      TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
    embedding     VECTOR(1024),             -- pgvector type, float16 stored
    embed_version SMALLINT NOT NULL,        -- bumped on model upgrade
    page          INT,
    offset_start  INT,
    offset_end    INT
);

CREATE INDEX chunks_hnsw ON chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 32, ef_construction = 200);
CREATE INDEX chunks_bm25 ON chunks USING gin (text_tsv);
CREATE INDEX chunks_tenant ON chunks (tenant_id);

ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

Sharding. At 40M chunks a single Postgres instance (db.r6i.4xlarge, 128 GB RAM) holds the HNSW index in memory. We do not shard until the largest tenant's index exceeds a single instance's RAM budget. When that happens, shard by tenant_id using Citus or a routing layer; never by chunk_id hash, because cross-shard ANN search loses recall.

Retention. Soft delete via deleted_at; nightly GC job hard-deletes after 30 days and removes vectors from the HNSW index. For tenants under GDPR, the GC runs on demand within 72 hours of an erasure request.


6. Critical Path: Query Flow

The hot path for a single query, with timings:

async def answer(query: str, tenant_id: UUID) -> Answer:
    # Step 1: cache lookup (5 ms p99)
    cache_key = hash_key(tenant_id, normalize(query), INDEX_VERSION)
    if cached := await redis.get(cache_key):
        return Answer.parse_raw(cached)

    # Step 2: parallel embed + BM25 (120 ms)
    embed_task  = embed_service.embed(query)
    bm25_task   = pg.fetch_bm25(query, tenant_id, k=50)
    q_vec, bm25_hits = await asyncio.gather(embed_task, bm25_task)

    # Step 3: dense ANN search (30 ms)
    dense_hits = await pg.fetch_ann(q_vec, tenant_id, k=50)

    # Step 4: reciprocal rank fusion (1 ms)
    fused = rrf(dense_hits, bm25_hits, k=60)[:50]

    # Step 5: cross-encoder rerank (150 ms)
    reranked = await reranker.score(query, [c.text for c in fused])
    top_k = [fused[i] for i in argsort(reranked)[-8:]]

    # Step 6: LLM generation with citations (1,400 ms p95)
    context = format_context(top_k)
    answer  = await llm.generate(SYSTEM_PROMPT, query, context)

    # Step 7: cache + emit telemetry
    await redis.setex(cache_key, 1800, answer.json())
    return answer

Reciprocal Rank Fusion (RRF) is the boring-and-correct way to combine BM25 and dense scores without learning per-tenant weights. The score is sum(1 / (k + rank_i)) across retrievers; k=60 is the canonical default and works well in practice.


7. Scaling & Bottlenecks

What breaks first as load or corpus grows:

  1. LLM generation latency tail. The single largest contributor to p95. Mitigations: (a) stream tokens to the client so time-to-first-token is the user-perceived metric; (b) cap output at 256 tokens with a "want more?" prompt; (c) provision Bedrock Provisioned Throughput Units rather than on-demand if queue waits exceed 200 ms p95.
  2. HNSW recall at high tenant_id filter selectivity. pgvector's HNSW does the filter after graph traversal, so a tenant holding 0.1% of the corpus may need ef_search raised to 200–400 to recover recall. At > 100 tenants, switch to per-tenant partial indexes or move that tenant to a dedicated index.
  3. Postgres connection storms. A spike in QPS turns into a spike in pool checkouts. Run PgBouncer in transaction-pooling mode and size the pool at (cores × 4) + spare, not at max_connections.
  4. Embedding service cold start. A new replica needs ~25 s to load the model into GPU memory. Keep one warm pool replica per AZ; pre-warm by hitting /health with a synthetic embed call after deploy.
  5. Cache stampede on viral queries. When a popular doc lands and everyone asks "what's new?", a single key miss can fan out to N concurrent LLM calls. Use SETNX + short lock or the cachetools single-flight pattern; serve a stale answer for ≤ 60 s while the fresh one computes.

8. Failure Modes & Resilience

Each external dependency has a defined degradation path:

Idempotency. Every request carries a client-generated request_id; the planner checks Redis for a 5-min in-flight marker before starting work. Duplicate requests return the same answer or wait on the marker, never trigger duplicate LLM bills.


9. Cost Analysis

Per 1,000 queries, assuming 40% cache hit rate (so 600 LLM calls):

embed_service:
  amortized_l4_gpu:    $0.05   # 1 L4 / 1000 QPS capacity
bedrock_haiku:
  input_tokens:        $1.50   # 600 calls x 4000 tok x $0.25 / 1M
  output_tokens:       $1.50   # 600 calls x 200 tok x $1.25 / 1M
reranker_gpu:           $0.05  # batched
postgres:               $0.20  # db.r6i.4xlarge amortized
redis_cache:            $0.05
network_egress:         $0.10
api_gateway:            $0.05
---
total_per_1000:         $3.50  # = $0.0035 per query
total_with_overhead:    $0.007 # double for monitoring, logs, support; well under $0.02 ceiling

The dominant line item is Bedrock token spend. If usage 10×'s, switch hot tenants to provisioned throughput (~30% cheaper at scale) or to on-prem Llama 3.1 8B (fixed GPU cost, marginal token cost ~zero).


10. Tradeoffs & Alternatives

pgvector vs Pinecone vs Qdrant. I would pick pgvector at this scale because it keeps the chunk text, metadata, and vector in one transactional store; one less system to operate, one less consistency boundary, and Postgres RLS gives tenant isolation for free. Pinecone wins above ~500M vectors or when you need geo-replication out of the box. Qdrant wins when you need rich payload filtering with high cardinality, and its scalar quantization saves real money at > 100M vectors. The test is operational complexity, not benchmarks: choose Postgres until it hurts.

Bedrock Knowledge Bases vs custom. Bedrock KB ships the same ingest → chunk → embed → OpenSearch pipeline I just described, and removes one team's worth of glue code. The reasons to build custom anyway: (1) control over the chunking algorithm (KB's defaults are weak for legal/medical docs); (2) hybrid retrieval with rerank, which KB does not expose well; (3) multi-region or on-prem deployment; (4) cost transparency at scale.

Embedding refresh strategy. When a new embedding model ships (say bge-large-v2), naively re-embedding 40M chunks costs ~$2k and takes hours on a single GPU pool. The cheap approach: dual-write embeddings under embed_version=2 alongside v1; route 5% of traffic to v2; promote when eval metrics improve; lazy-migrate the long tail by re-embedding chunks on next read. Avoids any downtime.

Tenant isolation: shared index + RLS vs per-tenant indexes. Shared index with RLS scales operationally (one Postgres, one HNSW). Per-tenant indexes scale on the recall axis (no filter penalty) and simplify deletion. Hybrid policy: shared index up to 100 tenants or until any tenant's chunks exceed 5% of total; promote large or regulated tenants to dedicated indexes.

Cache the embeddings of common queries. Below the answer cache, keep a smaller cache of normalized-query → embedding vector. Saves 50 ms and one GPU forward pass on common queries; the embedding is small (2 KB) so a 100k entry cache fits in 200 MB of Redis.


11. Common Interview Q&A

Q1: Why hybrid retrieval instead of dense-only?

Dense embeddings are great at semantic similarity but underperform on rare tokens, exact identifiers (case numbers, SKUs, error codes), and acronyms the model has never seen. BM25 catches those. On internal evals across legal, code, and support corpora, hybrid + RRF beats dense-only by 8–15% on top-5 recall, and the implementation cost is one Postgres GIN index. There is no defensible reason to ship dense-only in production unless you literally have no exact identifiers in your corpus.

Q2: How do you decide chunk size?

Start at 512 tokens with 64-token overlap; this is the sweet spot for most corpora and matches what the embedding model was trained on. Tune by measuring recall@k on a held-out eval set: if recall is low and chunks are long, the relevant span is being diluted by surrounding text — shrink to 256. If recall is high but the LLM gets confused, chunks are too short and missing context — grow to 1024. Never just pick a number from a blog post.

Q3: How do you handle a tenant requesting GDPR erasure of a single document?

Soft-delete the document row in Postgres, which cascades to chunks via ON DELETE CASCADE. The HNSW index does not physically remove points on row delete; pgvector marks them tombstoned and the GC job rebuilds the index segment within 72 h to satisfy GDPR Art. 17 timing. Until rebuild, RLS prevents the chunks from being returned. Audit logs of the original ingest remain (legitimate-interest exception) but never contain the chunk text itself.

Q4: When would you switch from pgvector to a dedicated vector DB?

Three triggers. (1) The HNSW index no longer fits in a single Postgres instance's RAM and partitioning by tenant is no longer enough — typically above 200–500M vectors. (2) Filter selectivity becomes pathological because RLS forces post-filter on a large shared index, and the pgvector roadmap for pre-filtered HNSW has not landed yet. (3) The team needs operational features Postgres does not have: snapshot replication of just the vector tier, real-time multi-region, or scalar quantization for cost. Until then, the integration cost of a separate system is not worth the marginal performance.

Q5: How do you prevent hallucination?

Three layers: (1) the system prompt requires every claim to cite a chunk ID from the provided context, and the planner post-validates that every cited chunk was actually in the retrieval; (2) when the top reranker score falls below a tuned threshold (e.g. 0.3), the planner returns "I don't have enough information to answer" rather than calling the LLM at all; (3) an output classifier flags answers whose entities do not appear in the cited chunks for human review. None of this guarantees zero hallucination, but it makes the common ones cheap to detect and fix.

Q6: Walk me through what happens when a customer uploads a 200-page PDF.

S3 PUT triggers an SQS message to the ingest worker. Worker fetches the PDF, runs OCR-or-text-extract via unstructured.io or Textract, normalizes to clean text. Splits at semantic boundaries (headings, paragraphs) targeting 512 tokens with 64 overlap, producing roughly 400 chunks for a 200-page legal PDF. Inserts the document row and chunk rows in one transaction (chunks get embedding=NULL). An async embed job picks up null-embedding chunks in batches of 256, calls the vLLM-served bge-large endpoint, and updates the rows. Postgres' HNSW index incrementally absorbs the new vectors. Total time from PUT to "fully searchable": about 90 seconds for a 200-page PDF, well under the 5-minute SLO. The user sees a "ready" notification via WebSocket.


↑ Back to Top