MyDocumentIntelligence.com — Architecture

MyDocumentIntelligence.com is a document-intelligence platform I built for legal and healthcare teams — the kind of customers who cannot paste a contract or a chart note into a public chatbot. The product answers natural-language questions over a customer's own document corpus and returns answers with page-level citations, optional bounding-box highlights, and a verifiable content hash. This page describes the production architecture: what is in the box, why each piece is there, and the tradeoffs that shaped the design.

The framing I keep coming back to is frontier + local hybrid. Frontier models (Claude on Bedrock, GPT-4o on Azure) handle the hard reasoning when the customer's data classification allows it. A local stack (Llama 3.x or Mistral served on vLLM, with bge embeddings) handles everything else — either because the customer is on a self-hosted tier, or because the document was flagged as containing protected information that should never leave the tenant boundary. The same retrieval, prompt, and evaluation layers serve both sides; only the generation endpoint changes.

1. Problem & Users

The two pilot customer profiles are: (a) a mid-size litigation firm whose paralegals spend hours-per-matter pulling clauses out of agreements, motions, and discovery PDFs, and (b) a healthcare operations team that needs to answer "what does this 200-page payor contract say about prior-authorization timelines for procedure X?" without uploading the contract to a public LLM. In both cases the question is not really "summarize this document" — it is "find the specific clause, quote it verbatim, and tell me which page it is on so I can defend the answer."

The failure modes I designed against, all of which I have personally watched generic chat tools commit on real legal documents:

Hallucinated clauses. A frontier model, asked about an indemnification carve-out, will paraphrase plausibly-sounding language that does not exist in the document. For an attorney that is worse than no answer.
No provenance. Even when the answer is correct, "trust me, it is in there somewhere" is not acceptable. The user needs page numbers and the literal source quote.
PII / PHI leakage. Pasting a document with a Social Security number or a patient identifier into a public LLM is a compliance event for these customers. The platform has to make that physically impossible for flagged documents.
Stale model knowledge. Off-the-shelf chat does not know the customer's specific master service agreement, payor contract, or internal policy. Generic RAG tutorials hand-wave this; production needs strict grounding.
Multi-document questions. "Across these 14 contracts, which ones have a most-favored-nation clause?" needs retrieval across a corpus, not chat over one document at a time.

Off-the-shelf RAG (drop documents into a vector DB, top-k cosine, stuff the prompt) does not survive contact with this domain. Contracts have nested numbered sections, defined terms that bind across the document, scanned amendments stapled onto digital originals, and signatures on the last page that need to be retrieved when the question is "who signed this?". The architecture below is everything I learned trying to make those questions work end to end.

2. High-level Architecture

The pipeline is built so that every stage is independently swappable — this is what lets the same platform run a SaaS multi-tenant deployment and a self-hosted single-tenant Docker stack from the same codebase.

                        +-----------------------------+
                        |  Document Sources           |
                        |  S3 / SharePoint / Upload   |
                        +--------------+--------------+
                                       |
                                       v
                        +-----------------------------+
                        |  Ingestion Worker           |
                        |  (SQS-driven, idempotent)   |
                        +--------------+--------------+
                                       |
                                       v
        +------------------------------+------------------------------+
        |  OCR / Layout Extraction                                    |
        |  PyMuPDF (native) | Textract / Azure DI / Tesseract (scan)  |
        +------------------------------+------------------------------+
                                       |
                                       v
                        +-----------------------------+
                        |  Section-Aware Chunking     |
                        |  (clause/heading boundaries)|
                        +--------------+--------------+
                                       |
                                       v
                        +-----------------------------+
                        |  Embedding                  |
                        |  bge-large (local) /        |
                        |  text-embedding-3-large     |
                        +--------------+--------------+
                                       |
                                       v
                        +-----------------------------+
                        |  Vector + Metadata Store    |
                        |  Postgres + pgvector (HNSW) |
                        +--------------+--------------+
                                       |
                          query path   |
                                       v
                        +-----------------------------+
                        |  Hybrid Retrieval           |
                        |  BM25 + dense, RRF fusion   |
                        +--------------+--------------+
                                       |
                                       v
                        +-----------------------------+
                        |  Cross-Encoder Reranker     |
                        |  bge-reranker-large         |
                        +--------------+--------------+
                                       |
                                       v
                        +-----------------------------+
                        |  LLM Router                 |
                        |  Frontier (Claude / GPT) or |
                        |  Local (Llama / Mistral)    |
                        +--------------+--------------+
                                       |
                                       v
                        +-----------------------------+
                        |  Cited Response + Audit Log |
                        |  JSON schema enforced       |
                        +-----------------------------+

The control plane (FastAPI + Postgres) holds tenants, users, document metadata, and audit records. The data plane (S3 + pgvector + the model endpoints) holds the actual content. The two are kept separate so I can hand a customer the data plane to run inside their VPC without giving them the multi-tenant control plane.

3. Document Ingestion & OCR

Ingestion is the unglamorous half of the system that determines whether everything downstream works. The corpus is heterogeneous: text-native PDFs from Word exports, scanned PDFs (some with skewed pages and stamped signatures), Word documents with tracked changes, and the occasional image attachment. The router decides per-page whether to use a fast text extractor or fall back to OCR.

PyMuPDF (fitz) for text-native PDFs. Fast, preserves bounding boxes per text span, no external service.
pdfplumber for tables — it gives row/column structure that I serialize into Markdown so the LLM can read it.
AWS Textract for scanned PDFs in the SaaS tier. It returns lines, words, tables, and key-value pairs with confidence scores.
Azure Document Intelligence for customers already on Azure. Comparable accuracy; better in my testing on dense legal layouts.
Tesseract for the self-hosted on-prem deployment where calling out to a managed OCR service is not allowed. Lower accuracy, much cheaper.

The decision rule is simple: if PyMuPDF returns less than 100 characters of extractable text on a page, treat the page as scanned and route to OCR. Per-page granularity matters because one document may have native text for the body and a scanned amendment glued onto the back.


from dataclasses import dataclass
from pathlib import Path
import fitz  # PyMuPDF

@dataclass
class PageContent:
    page_number: int           # 1-indexed
    text: str
    spans: list[dict]          # [{"bbox": (x0, y0, x1, y1), "text": "..."}, ...]
    extraction_method: str     # "native" | "textract" | "azure-di" | "tesseract"
    ocr_confidence: float | None  # average per-token confidence; None for native

def ingest(path: Path, ocr_backend: str = "textract") -> list[PageContent]:
    """Per-page extraction. Falls back to OCR when native text is sparse."""
    doc = fitz.open(path)
    pages: list[PageContent] = []

    for i, page in enumerate(doc, start=1):
        native_text = page.get_text("text").strip()
        if len(native_text) >= 100:
            spans = [
                {"bbox": s["bbox"], "text": s["text"]}
                for block in page.get_text("dict")["blocks"]
                for line in block.get("lines", [])
                for s in line.get("spans", [])
            ]
            pages.append(PageContent(i, native_text, spans, "native", None))
        else:
            # Scanned page — render to PNG and OCR it
            pix = page.get_pixmap(dpi=300)
            pages.append(_ocr_page(pix.tobytes("png"), i, backend=ocr_backend))

    return pages

Every ingested page is written back to Postgres with a content SHA-256, the extraction method, and the OCR confidence. That metadata travels with the chunk all the way to the answer, which is what makes the citation chain auditable.

4. Chunking Strategy

Fixed-size token chunking is the single biggest reason naive RAG fails on contracts. A clause that begins "Section 8.3(b)(ii) — Indemnification" and runs across a page break gets sliced in half by a 512-token splitter, the embedding for each half is mediocre, and neither half retrieves cleanly when the user asks about indemnification. The fix is to chunk by document structure first and only fall back to token-windowed splitting inside oversized sections.

The splitter walks the extracted text and produces chunks whose boundaries respect:

Numbered section headings (regex on patterns like ^\s*\d+(\.\d+)*\s+[A-Z]).
"ARTICLE" / "SECTION" / "WHEREAS" markers.
Paragraph breaks within an oversized section.
A hard cap of 800 tokens with 120-token overlap as the last resort.

Each chunk carries the metadata needed to rebuild the citation: document id, page numbers it spans, and the bounding-box hull of the source spans on each page.


import re
import tiktoken
from dataclasses import dataclass, field

ENC = tiktoken.get_encoding("cl100k_base")
SECTION_RE = re.compile(r"^\s*(?:ARTICLE|SECTION|\d+(?:\.\d+)*)\s+", re.MULTILINE)

@dataclass
class Chunk:
    doc_id: str
    chunk_id: str
    text: str
    pages: list[int]
    bboxes: dict[int, tuple]  # page -> bbox hull (x0, y0, x1, y1)
    section_path: list[str] = field(default_factory=list)
    token_count: int = 0

def chunk_document(pages: list, doc_id: str,
                   max_tokens: int = 800, overlap: int = 120) -> list[Chunk]:
    """Section-aware splitter with a token-window fallback."""
    full_text = "\n".join(p.text for p in pages)
    sections = _split_on_headings(full_text)  # uses SECTION_RE

    chunks: list[Chunk] = []
    for section in sections:
        toks = ENC.encode(section.text)
        if len(toks) <= max_tokens:
            chunks.append(_make_chunk(doc_id, section, pages))
            continue
        # Section is too long — slide a window over it
        for start in range(0, len(toks), max_tokens - overlap):
            window_text = ENC.decode(toks[start:start + max_tokens])
            chunks.append(_make_chunk(doc_id, section.with_text(window_text), pages))

    return chunks

The single most important property of this chunker is that chunk.pages and chunk.bboxes are filled in correctly — they are what the UI later uses to draw the yellow highlight on the source PDF when the user clicks a citation.

5. Embeddings & Vector Store

The platform supports two embedding modes, picked per tenant at provisioning time:

Mode	Model	Dim	Where it runs	Cost / 1M tokens
Frontier	OpenAI text-embedding-3-large	3072	OpenAI API	$0.13
Frontier (alt)	Cohere embed-english-v3	1024	Bedrock / Cohere	$0.10
Local / private	BAAI/bge-large-en-v1.5	1024	vLLM on g5.xlarge	~$0.02 amortized
Local / fast	BAAI/bge-small-en-v1.5	384	CPU on the API box	negligible

For most legal corpora the bge-large model is within ~2 points of MTEB on the frontier alternatives and runs entirely inside the customer's network. That is the single biggest reason the local mode is viable — the embedding gap is small, and the privacy gain is total.

The vector store is Postgres with the pgvector extension. I evaluated FAISS (in-memory; great for prototypes, awkward for multi-tenant updates and ACL filtering), Chroma (fine for development; not production-ready for the scale I needed), Pinecone and Weaviate (excellent products, but adding a managed dependency for a feature I could get from Postgres did not justify the bill or the data-residency conversation). Postgres gives me transactional inserts, row-level security per tenant, and metadata filtering in the same query as the vector search.


CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    chunk_id      UUID PRIMARY KEY,
    tenant_id     UUID NOT NULL,
    doc_id        UUID NOT NULL,
    text          TEXT NOT NULL,
    pages         INT[] NOT NULL,
    section_path  TEXT[],
    doc_type      TEXT,         -- 'contract' | 'policy' | 'medical_record' | ...
    jurisdiction  TEXT,
    effective_date DATE,
    embedding     VECTOR(1024) NOT NULL,
    content_sha   CHAR(64) NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT now()
);

-- HNSW index tuned for ~5M chunks per tenant
CREATE INDEX chunks_embedding_hnsw
    ON chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Row-level security so a tenant can never see another tenant's chunks
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

HNSW parameters: m = 16 and ef_construction = 200 at index build, ef_search = 80 at query time. Those numbers came out of a sweep against my evaluation set — pushing ef_search higher gave me single-digit recall improvements at noticeable latency cost; pushing it lower started to drop the cross-encoder's input candidates below what I needed.

6. Hybrid Retrieval & Reranking

Dense retrieval alone misses queries that hinge on a specific term — "MFN clause", a docket number, a defined term like "Effective Date" used as a proper noun. BM25 nails those. Conversely, BM25 fails when the user's wording does not match the document ("when can I get out of this contract?" vs. "termination for convenience"). The two are complementary, so I run both and fuse them with Reciprocal Rank Fusion.

Reciprocal Rank Fusion is the right merge function here because it is rank-based, so I do not need to calibrate the BM25 score and the cosine similarity into the same units. I take the top 50 from each retriever, fuse, then send the top 25 through a cross-encoder reranker. The reranker is what turns "the relevant chunk is somewhere in the top 25" into "the relevant chunk is in the top 3" — which is what actually matters because the LLM context window is finite.


from collections import defaultdict
from typing import Sequence
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-large", max_length=512)

def reciprocal_rank_fusion(rankings: Sequence[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """Standard RRF: score(d) = sum_i 1 / (k + rank_i(d))."""
    scores: dict[str, float] = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def hybrid_retrieve(query: str, tenant_id: str, filters: dict,
                    top_k_each: int = 50, top_n_final: int = 6) -> list[dict]:
    bm25_ids  = bm25_search(query, tenant_id, filters, limit=top_k_each)
    dense_ids = pgvector_search(query, tenant_id, filters, limit=top_k_each)

    fused = reciprocal_rank_fusion([bm25_ids, dense_ids])
    candidate_ids = [doc_id for doc_id, _ in fused[:25]]
    candidates = load_chunks(candidate_ids)

    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

    return [c for c, _ in ranked[:top_n_final]]

Metadata filters (jurisdiction, doc_type, effective_date ranges) are pushed into both the BM25 query and the pgvector WHERE clause. A "find the termination clause in our California vendor MSAs signed after 2023" query narrows the candidate pool by metadata first, then runs hybrid retrieval over the narrowed set. This is the single biggest precision win after section-aware chunking.

7. LLM Routing: Frontier vs Local

The LLM router is what makes the "frontier + local hybrid" promise real. Every request carries a routing context with the tenant's tier, the document's classification, and a per-request flag indicating whether the retrieved chunks tripped any PII / PHI detector. The router resolves those into one of two endpoints:

Frontier path: Claude (Anthropic, or via Bedrock anthropic.claude-opus-4-7) for the hardest reasoning — multi-clause questions, contradiction-finding, comparative questions across many documents. GPT-4o on Azure is the secondary frontier endpoint for customers who already have Azure.
Local path: Llama 3.1 70B Instruct or Mistral Large served on vLLM on a single g5.12xlarge (4×A10G) node inside the customer's VPC. Used whenever any of: tenant tier is "self-hosted", the document classification is PHI / privileged, the PII detector flagged the retrieved chunks, or the customer policy forbids cross-region data flow.

Endpoint	p50 latency	p95 latency	$ / 1M out	Used for
Claude Opus 4.x (Bedrock)	2.1s	4.6s	$15.00	Hard reasoning, default frontier
Claude Sonnet 4.x (Bedrock)	1.1s	2.4s	$3.00	Most cited-answer queries
GPT-4o (Azure)	1.3s	2.9s	$10.00	Azure-tier customers
Llama 3.1 70B (vLLM, g5.12xl)	1.6s	3.8s	~$0.80	Self-hosted, PHI, privileged
Mistral Large (vLLM)	1.4s	3.2s	~$0.70	Local fallback


from enum import Enum
from dataclasses import dataclass

class Endpoint(Enum):
    CLAUDE_OPUS   = "anthropic.claude-opus-4-7"
    CLAUDE_SONNET = "anthropic.claude-sonnet-4-7"
    GPT_4O        = "azure.gpt-4o"
    LLAMA_LOCAL   = "vllm.llama3.1-70b-instruct"

@dataclass
class RouteContext:
    tenant_tier: str           # "saas" | "self-hosted"
    doc_classification: str    # "public" | "confidential" | "phi" | "privileged"
    pii_in_context: bool       # any retrieved chunk flagged by detector
    question_complexity: str   # "simple" | "comparative" | "multi-doc"
    allow_cross_region: bool

def route(ctx: RouteContext) -> Endpoint:
    must_be_local = (
        ctx.tenant_tier == "self-hosted"
        or ctx.doc_classification in ("phi", "privileged")
        or ctx.pii_in_context
        or not ctx.allow_cross_region
    )
    if must_be_local:
        return Endpoint.LLAMA_LOCAL

    if ctx.question_complexity == "multi-doc":
        return Endpoint.CLAUDE_OPUS
    return Endpoint.CLAUDE_SONNET

The router is a hard gate, not a recommendation. A request that resolves to LLAMA_LOCAL never instantiates an outbound HTTP client to the frontier providers — the network calls are not even reachable from that code path. That is what lets me put "your data never leaves your VPC" in the contract and mean it.

8. Prompts & Structured Output

The system prompt does three things and only three things: it sets the role, it specifies the refusal policy when the retrieved context does not support an answer, and it enforces the citation contract. Every other behavior I want is in the user prompt or, more importantly, in the JSON schema the response has to match.


SYSTEM_PROMPT = """You are a document-intelligence assistant for legal and
healthcare professionals. You answer questions ONLY using the provided
document excerpts. You do not use prior knowledge of any specific contract,
case, statute, or patient record.

Rules, in order of precedence:

1. If the provided excerpts do not contain enough information to answer the
   question, return answer = null and explanation describing exactly what is
   missing. Do NOT guess.
2. Every factual claim in your answer must be supported by a citation that
   names the chunk_id, page number, and a verbatim supporting_quote of less
   than 240 characters from that chunk.
3. Never reproduce more than 240 characters of source text in any single
   field. If the user asks for the full clause, instruct them to view the
   original document via the citation.
4. If the question asks for legal advice or a clinical recommendation,
   answer the underlying factual question and add a one-sentence note that
   the user should confirm with a licensed professional.
"""

The output is constrained by a Pydantic schema that I pass to the model as a tool definition (Anthropic tool-use) or a structured-output schema (OpenAI). The model cannot produce free-form prose — it has to produce a JSON object that matches the schema, every time. That is the single highest-leverage change I made for production reliability.


from pydantic import BaseModel, Field
from typing import Literal

class Citation(BaseModel):
    chunk_id: str
    doc_id: str
    page: int = Field(ge=1)
    supporting_quote: str = Field(max_length=240)
    confidence: float = Field(ge=0.0, le=1.0)

class StructuredAnswer(BaseModel):
    answer: str | None
    citations: list[Citation]
    explanation: str
    refusal_reason: Literal["insufficient_context", "out_of_scope", None] = None

ANSWER_TOOL = {
    "name": "submit_answer",
    "description": "Submit the final answer in the required structured form.",
    "input_schema": StructuredAnswer.model_json_schema(),
}

def call_claude_with_schema(client, model_id, system, user, retrieved):
    return client.messages.create(
        model=model_id,
        system=system,
        max_tokens=1500,
        temperature=0,
        tools=[ANSWER_TOOL],
        tool_choice={"type": "tool", "name": "submit_answer"},
        messages=[{
            "role": "user",
            "content": _format_context(user, retrieved),
        }],
    )

Forcing tool_choice to the answer tool means the model has no path to produce free text. Combined with temperature=0 and the Pydantic validator on the way out, the failure mode "model returns prose I cannot parse" went from a measurable slice of production traffic to zero.

9. Citations & Provenance

Every citation in the response carries a chunk_id that resolves through the chunks table to a doc_id, a list of pages, and the bounding-box hull on each page. The web UI uses those bounding boxes to draw a yellow highlight on the source PDF rendered alongside the answer. From the user's perspective: they read the answer, click "page 14", the PDF panel scrolls to page 14 and highlights the cited region. That single interaction was what convinced the legal pilot customer to sign.

For the audit story I add a per-response provenance record:


import hashlib, json, time
from uuid import uuid4

def write_audit_record(tenant_id: str, request_id: str, question: str,
                       retrieved_chunks: list[dict], structured_answer: dict,
                       model_id: str) -> dict:
    """One immutable record per answered question."""
    payload = {
        "request_id": request_id,
        "tenant_id":  tenant_id,
        "ts":         int(time.time()),
        "model_id":   model_id,
        "question":   question,
        "retrieved": [
            {"chunk_id": c["chunk_id"], "doc_id": c["doc_id"],
             "content_sha": c["content_sha"]}
            for c in retrieved_chunks
        ],
        "answer": structured_answer,
    }
    canonical = json.dumps(payload, sort_keys=True, separators=(",", ":")).encode()
    payload["content_hash"] = hashlib.sha256(canonical).hexdigest()
    s3_put_object_lock(payload, bucket="mdi-audit", retention_days=2555)
    return payload

The audit record goes to an S3 bucket with Object Lock in compliance mode and a seven-year retention. A customer who later needs to prove "on this date, given these source documents, the system returned this answer" can produce the record, recompute the SHA, and verify nothing was edited. That is what "verifiable" means in this product, and it is most of why the regulated-industry conversations actually go anywhere.

10. Evaluation

There are two evaluation regimes that run continuously: an offline gold dataset and a nightly RAGAS sweep against a held-out portion of each tenant's corpus.

The gold dataset is human-labeled: a question, the documents from which the answer must come, the expected page number, and the expected supporting quote. Roughly 250 questions across legal and healthcare, expanded as the customers send me the queries that broke. For each question I compute:

Retrieval recall@k — did the expected page show up in the top-k retrieved chunks (k = 6 by default).
Citation correctness — did the answer cite the expected page number.
Answer faithfulness — LLM-as-judge with Claude as grader, prompted to flag any claim in the answer that is not supported by the supporting quote.
Refusal correctness — on questions seeded with intentionally insufficient context, did the model correctly return answer = null.


from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)

def run_ragas(eval_rows: list[dict]) -> dict:
    """Each row: {question, answer, contexts: list[str], ground_truth}."""
    ds = Dataset.from_list(eval_rows)
    result = evaluate(
        ds,
        metrics=[faithfulness, answer_relevancy,
                 context_precision, context_recall],
    )
    return result.to_pandas().mean(numeric_only=True).to_dict()

Run output is shipped to LangSmith for the SaaS deployment and to Phoenix for self-hosted (Phoenix runs in-cluster, no outbound calls). The threshold I gate releases on: faithfulness must not drop more than 0.02 from the previous release on the gold set; if it does, the deployment does not promote.

11. Deployment & Infrastructure

The SaaS deployment is FastAPI on AWS Fargate, fronted by CloudFront with WAF rules for rate limiting and basic prompt-injection patterns. Postgres + pgvector runs on RDS (db.r6g.2xlarge for the current load). Documents live in S3 with bucket-level KMS, server-side encryption, and a per-tenant prefix. Secrets are in AWS Secrets Manager, retrieved at boot via the task's IAM role — no credentials in environment files.

The local-LLM endpoint (Llama 3.1 70B) is vLLM on a single g5.12xlarge for SaaS customers who opted into the local-only tier. The same Docker image runs on a customer-provided GPU box for fully self-hosted deployments — that is the entire point of building local-first.


# docker-compose.yml — self-hosted single-tenant deployment
services:
  api:
    image: ghcr.io/mydocintel/api:1.14.0
    environment:
      - DB_URL=postgresql://app:${DB_PASSWORD}@postgres:5432/mdi
      - LLM_BACKEND=vllm
      - VLLM_ENDPOINT=http://vllm:8000/v1
      - EMBEDDING_BACKEND=bge-large
      - DEPLOYMENT_MODE=self-hosted
      - ALLOW_FRONTIER=false
    depends_on: [postgres, vllm]
    ports: ["8443:8443"]

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - pg-data:/var/lib/postgresql/data

  vllm:
    image: vllm/vllm-openai:v0.6.3
    command: >
      --model meta-llama/Meta-Llama-3.1-70B-Instruct
      --tensor-parallel-size 4
      --max-model-len 16384
      --gpu-memory-utilization 0.92
    deploy:
      resources:
        reservations:
          devices: [{ driver: nvidia, count: 4, capabilities: [gpu] }]

volumes:
  pg-data:

The single bash command below is what the customer runs to bring up the self-hosted stack on a fresh EC2 g5.12xlarge with the NVIDIA container toolkit installed. That deliberate shortness is itself a feature — legal IT teams will not adopt anything that takes a week to install.


# On a fresh g5.12xlarge with Docker + nvidia-container-toolkit
git clone https://github.com/mydocintel/self-hosted.git
cd self-hosted
cp .env.sample .env && vi .env   # set DB_PASSWORD, license key, KMS key id
docker compose pull
docker compose up -d
./scripts/healthcheck.sh         # verifies api, db, vllm are all green

12. Observability & Guardrails

Structured logging is non-negotiable for a system that has to produce an audit trail. Every request emits a single JSON log line with the request id, tenant id, model id, retrieved chunk ids, token counts (input / output / total), latency, and the boolean refusal flag. I use those records to build a per-tenant cost dashboard (input tokens × price + output tokens × price, summed by day) and a latency dashboard with p50 / p95 / p99 by endpoint.

PII / PHI detection runs on both sides of the LLM call. On the input side I run Microsoft Presidio across the retrieved chunks; if any high-confidence detection lands the request is rerouted to the local model regardless of the original routing decision. On the output side, the same detector runs on the structured answer and any leaked identifiers cause a hard refusal — the user sees a "this answer was suppressed because it contained protected information" message rather than the leak.


from presidio_analyzer import AnalyzerEngine

ANALYZER = AnalyzerEngine()
PHI_ENTITIES = {"US_SSN", "MEDICAL_LICENSE", "US_DRIVER_LICENSE",
                "PERSON", "DATE_TIME", "PHONE_NUMBER", "EMAIL_ADDRESS"}

def scan_for_pii(text: str, threshold: float = 0.6) -> list[dict]:
    results = ANALYZER.analyze(text=text, language="en",
                               entities=list(PHI_ENTITIES))
    return [
        {"type": r.entity_type, "score": r.score,
         "start": r.start, "end": r.end}
        for r in results if r.score >= threshold
    ]

def enforce_pii_routing(ctx, retrieved_chunks):
    for chunk in retrieved_chunks:
        if scan_for_pii(chunk["text"]):
            ctx.pii_in_context = True
            return ctx
    return ctx

For the SaaS tier I also attach Bedrock Guardrails to the Claude calls as a defense in depth — denied topics, profanity, and prompt-injection detection. Rate limiting is per-tenant token-bucket in Redis (60 questions / minute soft cap, configurable). The audit log goes to S3 Object Lock in compliance mode so that even an account compromise cannot delete prior records before the retention period expires.

13. What I’d Do Differently

The honest list of things I would change next, in roughly the order I plan to ship them:

Agentic decomposition for multi-doc questions. Today, "across these 14 contracts find the most-favored-nation clauses" runs as a single retrieve-then-generate. It would be much more reliable as a planner that issues one retrieval per contract and then synthesizes. I built the structured-output layer with this in mind; the planner is the next thing on the roadmap.
Fine-tuned clause classifier. The same five questions account for a large share of legal traffic ("termination", "indemnification", "limitation of liability", "governing law", "assignment"). A small fine-tuned classifier on the document-type level could pre-extract those clauses at ingest time, so the answer is a database lookup instead of a retrieval round-trip. Faster, cheaper, and more reliable for the head of the distribution.
Adopt MCP for customer system integration. The current "connect this to your case-management system" story is a custom integration per customer. Model Context Protocol gives me a clean tool-server abstraction to let customers expose their own data sources, and frontier models already speak it natively.
GraphRAG for cross-document entity linking. Contracts reference parties, defined terms, and prior agreements. A knowledge-graph layer over the corpus — populated at ingest with an extraction pass — would let me answer questions like "show me every agreement where Acme Corp granted exclusivity to a counterparty" without scanning every document. The retrieval layer is general enough to plug a graph store next to pgvector; the work is in the extraction quality.

Common Interview Questions:

How did you decide on chunk size and overlap for legal contracts?

I started with the conventional 512-token chunks at 64-token overlap and ran a small RAGAS eval on a held-out set of contract questions. Legal text has long, defined-term-heavy sentences, so smaller chunks fragmented clauses across boundaries and dropped context_recall. I ended up at ~800 tokens with 100-token overlap and a hard rule never to split inside a numbered clause — I parse the document into clause-level units first and only re-chunk if a single clause exceeds the budget. The eval moved faithfulness up about 6 points after that change.

Why hybrid retrieval instead of pure dense?

Legal queries contain a lot of exact tokens that embeddings under-weight — party names, statute citations like "12 U.S.C. § 1841", section numbers, and defined terms in quotes. BM25 nails those; dense vectors handle the paraphrase cases ("can the lessee assign?" vs "assignment by tenant"). I fuse the two ranked lists with Reciprocal Rank Fusion (k=60) which avoids having to calibrate score scales per corpus, then a cross-encoder reranks the top 50 down to the 5 we send to the LLM. Recall@50 from retrieval and nDCG@5 after rerank are tracked separately because they answer different questions.

How do you route between frontier and local models?

Routing is on three signals: document sensitivity (customer-flagged "do not send to third party" forces a local Llama-3.1 70B on a self-hosted vLLM endpoint), query complexity (a small classifier flags multi-document synthesis vs simple lookup), and cost budget per tenant. The default is Claude Sonnet for everyday Q&A because the price/quality is hard to beat; Opus is reserved for synthesis across >5 chunks. Every routed call logs the chosen model, latency, and token counts so I can run a monthly review and re-tune the thresholds.

How do you handle multi-tenancy in the vector store?

I use pgvector with a tenant_id column on the chunks table and a btree index on (tenant_id, document_id). Every retrieval query has WHERE tenant_id = $1 enforced at the application layer and again as a Postgres row-level security policy — defense in depth, because a missing filter is a data-leak bug. For the few large customers I move them to their own schema so their HNSW index isn't competing with smaller tenants for shared_buffers. Weaviate has native multi-tenancy that's nicer ergonomically but pgvector wins on operational simplicity at my scale.

What does your evaluation pipeline actually look like?

I keep a gold set of about 200 (question, document, expected_answer, expected_chunks) tuples that I built by hand with a paralegal. CI runs RAGAS on every PR that touches retrieval, prompts, or model versions — faithfulness, answer_relevance, context_precision, context_recall — and compares against a baseline_scores.json checked into the repo. Regressions block the merge unless I explicitly bump the baseline with a justification. I also run a weekly LLM-as-judge pairwise comparison on production traffic samples to catch slow drift the gold set wouldn't see.

What would you change if you rebuilt MyDocumentIntelligence today?

Three things. First, I'd add a small fine-tuned clause classifier at ingest so the head of the question distribution (termination, indemnification, governing law) becomes a database lookup instead of a retrieval round-trip — faster, cheaper, more deterministic. Second, I'd build a knowledge-graph layer alongside pgvector for cross-document entity questions ("every agreement where Acme granted exclusivity") because pure vector retrieval can't aggregate. Third, I'd adopt MCP for customer-system integrations so I'm not writing a custom adapter every time a customer wants their case-management system in scope.

↑ Back to Top