LLM Eval Platform: 50 Teams, Offline Suites + Online A/B + Regression Detection

The brief: build an evaluation platform for LLM applications that supports offline eval suites (gated against curated datasets), online A/B testing (production traffic split between model variants), and continuous regression detection (every commit on every product team). It must serve 50 product teams, scale to ~10,000 evals per day per team, keep judge-model token spend predictable, and surface results to a dashboard within five minutes of run completion.

Most "we built our own eval platform" projects fail in one of three ways: the schema can't represent the relationship between datasets, runs, and judgments cleanly; the judge cost spirals because nobody put a budget on LLM-as-judge calls; or the regression detector cries wolf so often that teams disable alerts. This design is opinionated about all three.

1. Problem & Functional Requirements

The platform must support these workflows:

Dataset management. Teams upload eval datasets (jsonl) and version them. A dataset is a set of cases: input + optional ground-truth output + metadata.
Offline run. Pick a dataset + an LLM app endpoint + a judge config; produce per-case scores and aggregate metrics.
Online A/B. Route a percentage of production traffic between variants; sample logged interactions; score them post-hoc with the same judges.
Regression detection. On every commit (via GitHub Actions) run a small "smoke" subset; on every nightly run the full suite; alert on metric degradation that exceeds the noise floor.
Dashboards. Per-team, per-dataset, per-metric trends; click into a case to see input, output, judgment, prompt version, model version, commit SHA.
Cost governance. Per-team monthly judge-token budget; runs that would exceed it queue or hard-fail with a clear error.

Out of scope: prompt authoring UI (use the team's repo), model training, RAG-pipeline-specific evals beyond providing the hooks. Adversarial/red-team evaluation is a sibling system.

2. Non-Functional Requirements & SLOs

Metric	Target
Run throughput per team	10,000 cases/day sustained, 50,000/day burst
CI smoke-suite latency	p95 < 5 min for 100-case smoke
Time to dashboard after run completion	p95 < 5 min
Judge cost ceiling	$0.005 per judged case (target); $0.02 hard cap
Data retention	Hot 90 days, cold 2 years
Availability	99.5% (eval is async; not customer-facing)
Regression alert false-positive rate	< 5% (or teams will mute it)

The dashboard SLO matters: if engineers wait more than five minutes to see results, they context-switch and the eval loop loses its compounding effect. The judge cost ceiling matters more than people expect — LLM-as-judge using GPT-4-class models at $0.01–0.03 per case can outspend the production system being evaluated.

3. Capacity Estimates

Aggregate eval volume. 50 teams × 10,000 cases/day = 500,000 evals/day, peak burst 2.5M. At ~1.5s per case (LLM call + judge call serial) on average, single-threaded that's 8,700 hours/day — trivially parallelizable.

Judge token spend. Average judge input ~800 tokens (prompt + case + ground truth + LLM output) and 100 tokens output. At GPT-4o pricing ($2.50 / $10.00 per 1M tokens):

per_case_judge_cost: (800 * 2.50 + 100 * 10.00) / 1_000_000 = $0.003
daily_total:         500_000 cases * $0.003 = $1,500/day = $45k/month
peak_burst:          2.5M * $0.003 = $7,500/day at burst

Mitigations baked into capacity: judge-model tier (Haiku for cheap pre-filter, GPT-4o only for borderline), prompt-cache the static parts of the judge prompt, sample (don't judge every online interaction).

Storage. Each result row: ~4 KB (input + output + judgment + metadata). 500k/day × 90 days hot ≈ 180 GB hot in Postgres. Cold tier (Parquet on S3): 4 TB across 2 years.

Compute for the runner. Most LLM call latency is network + remote inference. 50 concurrent workers × 1.5s avg ≈ 33 cases/sec sustained, 2.9M/day — comfortably above the 500k baseline. Burst absorbed by horizontal scale on Fargate or K8s HPA.

4. High-Level Architecture

  +------------------+       +-----------------+
  |  Team Repos      |       |  Web Dashboard  |
  |  (GH Actions)    |       |  (Next.js)      |
  +--------+---------+       +--------+--------+
           | trigger                  | reads
           v                          v
  +-----------------------+   +----------------------+
  |  Eval API (FastAPI)   |<->|  Auth (OIDC + RBAC) |
  +----------+------------+   +----------------------+
             |
             | enqueue run
             v
  +-----------------------+         +-------------------------+
  |  Run Orchestrator     |-------->|  Job Queue (SQS / Redis)|
  +----------+------------+         +-----------+-------------+
             |                                  |
             |                                  v
             |                       +----------+-----------+
             |                       |  Eval Workers        |
             |                       |  (Fargate/K8s,       |
             |                       |   horizontal)        |
             |                       +----------+-----------+
             |                                  |
             |              +-------------------+-------------------+
             |              |                   |                   |
             v              v                   v                   v
  +------------------+ +-----------+    +-----------+    +-------------------+
  |  Postgres (OLTP) | |  LLM under|    |  Judge    |    |  Cost Tracker     |
  |  datasets, runs, | |  test     |    |  Models   |    |  (Redis counters, |
  |  results, judges | |  (team's) |    |  (Bedrock/|    |   nightly rollup) |
  +--------+---------+ +-----------+    |   OpenAI) |    +-------------------+
           |                            +-----------+
           |  CDC
           v
  +------------------+        +-------------------+
  |  S3 Parquet Lake |<------>|  Trino / Snowflake|  (analytics, regression)
  +------------------+        +-------------------+

One-liners on each:

Eval API — FastAPI service. CRUD for datasets, runs, prompts; enforces per-team budget; returns run IDs.
Run Orchestrator — splits a run into batches of N cases, enqueues jobs, tracks state, retries failed cases up to a cap.
Eval Workers — pull jobs, call the team's LLM endpoint with the case input, call the configured judge, write results to Postgres, increment cost counters.
Postgres (OLTP) — source of truth for dataset versions, run state, results, judgments. Indexed for dashboard queries.
S3 Parquet Lake — CDC from Postgres via Debezium → Kafka → Iceberg/Parquet. Trino or Snowflake reads it for cross-team analytics and regression detection.
Cost Tracker — Redis counters bumped on every judge call; nightly rollup persists to Postgres for billing/budget enforcement.

5. Data Model & Storage

Postgres schema. Six core tables, one transaction boundary, normalized enough to query cleanly without becoming a join nightmare.

CREATE TABLE teams (
  team_id      UUID PRIMARY KEY,
  slug         TEXT UNIQUE NOT NULL,
  monthly_budget_usd NUMERIC(10,2) NOT NULL DEFAULT 1000,
  created_at   TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE datasets (
  dataset_id   UUID PRIMARY KEY,
  team_id      UUID NOT NULL REFERENCES teams,
  name         TEXT NOT NULL,
  version      INT NOT NULL,
  case_count   INT NOT NULL,
  s3_uri       TEXT NOT NULL,            -- jsonl in S3
  schema_hash  BYTEA NOT NULL,           -- detect schema drift across versions
  created_at   TIMESTAMPTZ DEFAULT now(),
  UNIQUE (team_id, name, version)
);

CREATE TABLE prompts (
  prompt_id    UUID PRIMARY KEY,
  team_id      UUID NOT NULL REFERENCES teams,
  name         TEXT NOT NULL,
  version      INT NOT NULL,
  body         TEXT NOT NULL,
  created_at   TIMESTAMPTZ DEFAULT now(),
  UNIQUE (team_id, name, version)
);

CREATE TABLE runs (
  run_id       UUID PRIMARY KEY,
  team_id      UUID NOT NULL REFERENCES teams,
  dataset_id   UUID NOT NULL REFERENCES datasets,
  prompt_id    UUID REFERENCES prompts,
  llm_model    TEXT NOT NULL,            -- e.g. "claude-sonnet-4-7"
  judge_config JSONB NOT NULL,           -- which judges, weights, thresholds
  status       TEXT NOT NULL,            -- queued, running, completed, failed, budget_blocked
  triggered_by TEXT,                     -- "ci:commit_sha" | "manual:user@" | "schedule"
  commit_sha   TEXT,
  started_at   TIMESTAMPTZ,
  finished_at  TIMESTAMPTZ,
  case_count   INT NOT NULL,
  cost_usd     NUMERIC(10,4) DEFAULT 0
);
CREATE INDEX runs_team_started ON runs (team_id, started_at DESC);
CREATE INDEX runs_dataset      ON runs (dataset_id, started_at DESC);

CREATE TABLE results (
  result_id    UUID PRIMARY KEY,
  run_id       UUID NOT NULL REFERENCES runs ON DELETE CASCADE,
  case_id      TEXT NOT NULL,            -- stable ID from the dataset
  input        JSONB NOT NULL,
  ground_truth JSONB,
  model_output TEXT NOT NULL,
  latency_ms   INT,
  tokens_in    INT,
  tokens_out   INT,
  cost_usd     NUMERIC(10,6),
  status       TEXT NOT NULL,            -- ok | model_error | timeout
  created_at   TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX results_run ON results (run_id);

CREATE TABLE judgments (
  judgment_id  UUID PRIMARY KEY,
  result_id    UUID NOT NULL REFERENCES results ON DELETE CASCADE,
  judge_name   TEXT NOT NULL,            -- e.g. "factual_accuracy", "hallucination_v2"
  judge_model  TEXT NOT NULL,            -- "gpt-4o" or "exact_match" for non-LLM judges
  score        NUMERIC(6,4),             -- 0..1 normalized
  rationale    TEXT,                     -- judge's explanation, optional
  cost_usd     NUMERIC(10,6),
  created_at   TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX judgments_result ON judgments (result_id);
CREATE INDEX judgments_judge_score ON judgments (judge_name, score);

Why JSONB for input / ground_truth. Cases are heterogeneous across datasets — multi-turn conversations, RAG queries with retrieved chunks, agent traces. JSONB keeps the schema flat without a column explosion; specific dashboards extract the fields they need with jsonb_path_query.

Why a separate judgments table. A single result can have multiple judgments (factuality + harmlessness + format-compliance) and the set of judges evolves. One-row-per-result with N judge columns becomes painful within a quarter.

CDC to lake. Debezium on Postgres → Kafka → Iceberg tables in S3. Trino reads cross-team aggregates; Snowflake or DuckDB also work. The lake powers regression detection (compare today's nDCG vs the trailing 30-day mean per metric per dataset) and billing rollups.

6. Critical Path: An Eval Run, End to End

A team commits to main. CI calls the eval API:

# .github/workflows/eval.yml
name: eval-on-pr
on: [pull_request]
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trigger eval
        env:
          EVAL_TOKEN: $
        run: |
          RUN_ID=$(curl -s -X POST https://eval.internal/api/runs \
            -H "Authorization: Bearer $EVAL_TOKEN" \
            -H "Content-Type: application/json" \
            -d '{
              "dataset": "rag-smoke@v3",
              "prompt": "answer-prompt@v12",
              "llm_model": "claude-sonnet-4-7",
              "judge_config": {"judges": ["factuality", "format"], "judge_model": "gpt-4o-mini"},
              "commit_sha": "$",
              "triggered_by": "ci:$"
            }' | jq -r .run_id)
          echo "Run: https://eval.internal/runs/$RUN_ID"
          # Block PR until done; polls every 10s, fails if regression detected.
          ./scripts/wait-for-eval.sh "$RUN_ID"

Inside the API, on POST /api/runs:

Authorize. Token resolves to a team; reject if the dataset / prompt belong to another team (or shared registry).
Estimate cost. case_count × (avg_llm_cost + avg_judge_cost). Reject if it would breach the team's monthly budget.
Create run row. status='queued'. Return run_id.
Enqueue. Push N batch messages to SQS, each with a slice of case IDs.

The worker, per case:

async def process_case(run: Run, case: Case):
    # 1. Call the team's LLM endpoint with retries
    t0 = time.time()
    try:
        out = await llm_client.invoke(
            model=run.llm_model,
            prompt=render(run.prompt, case.input),
            timeout=60,
        )
    except (TimeoutError, ProviderError) as e:
        await record_result(run, case, status="model_error", error=str(e))
        return

    # 2. Persist the model result row
    result_id = await record_result(
        run, case,
        model_output=out.text,
        latency_ms=int((time.time()-t0)*1000),
        tokens_in=out.usage.input,
        tokens_out=out.usage.output,
        cost_usd=out.cost,
        status="ok",
    )

    # 3. Run judges; some are deterministic (exact_match), some are LLM-as-judge
    for judge in run.judge_config["judges"]:
        score, rationale, cost = await JUDGES[judge].score(
            case=case, model_output=out.text,
            judge_model=run.judge_config["judge_model"],
        )
        await record_judgment(result_id, judge, score, rationale, cost)
        await cost_tracker.add(run.team_id, cost)

    # 4. Bump per-run progress
    await runs.increment_progress(run.run_id)

When the orchestrator sees progress == case_count, it transitions the run to completed, computes aggregate metrics (mean per judge, p95 latency, cost), and writes them to a run_metrics view. Dashboards listen on a Postgres LISTEN/NOTIFY channel for sub-second refresh.

Regression detection runs on the lake side as a scheduled Trino query:

-- Per (team, dataset, judge), compare today's mean to trailing 30-day mean & stdev.
WITH baseline AS (
  SELECT team_id, dataset_id, judge_name,
         AVG(score) AS mu, STDDEV(score) AS sigma
  FROM v_judgments
  WHERE created_at >= current_date - INTERVAL '30' DAY
    AND created_at <  current_date - INTERVAL '1'  DAY
  GROUP BY team_id, dataset_id, judge_name
),
today AS (
  SELECT team_id, dataset_id, judge_name, AVG(score) AS today_mean, COUNT(*) AS n
  FROM v_judgments
  WHERE created_at >= current_date
  GROUP BY team_id, dataset_id, judge_name
)
SELECT t.*, b.mu, b.sigma,
       (b.mu - t.today_mean) / NULLIF(b.sigma, 0) AS z_drop
FROM today t JOIN baseline b USING (team_id, dataset_id, judge_name)
WHERE t.n >= 100                       -- enough samples for the test to mean anything
  AND (b.mu - t.today_mean) / NULLIF(b.sigma, 0) > 2.5   -- 2.5-sigma drop
ORDER BY z_drop DESC;

Anything above 2.5σ with N ≥ 100 fires a Slack alert to the team's eval channel with deeplinks to the offending run.

7. Scaling & Bottlenecks

Judge-model rate limits. A burst of 50k cases hits OpenAI's per-minute token limit instantly. Mitigations: per-team token bucket in front of the judge call; OpenAI Batch API for non-time-critical runs (50% discount, 24h SLA); fall back to Claude Haiku or self-hosted judge when the primary is throttled.
Postgres write hot spot on the judgments table. 2M inserts/day at burst, mostly JSONB. Use COPY-style bulk insert in batches of 500 from each worker; partition judgments by month with attach/detach for rolling retention.
Dashboard query latency over the lake. Trino over 4 TB of Parquet is fine at p95 ~3–8s for typical aggregates; use materialized rollups (`run_metrics`, `daily_team_metrics`) refreshed on completion to keep dashboards under 1s.
Worker fan-out vs ordering. Cases within a run are independent; no ordering required. SQS standard queue is fine; visibility timeout 90s > 60s LLM timeout + buffer.
The team that runs eval on every commit. One team can monopolize judge spend. The cost tracker enforces budget at run-create time, but a single huge run that under-estimates can still burn through. Mid-run budget check every 1k cases pauses the run if cumulative cost > 1.5x estimate.

8. Failure Modes & Resilience

Judge model down. Per-judge circuit breaker: 5 consecutive failures → mark judgments.score = NULL with status='judge_unavailable'; the run completes but flagged as partially-judged. Operator can re-judge later from stored (case, model_output) without re-running the LLM under test.
Dataset corruption / schema drift. schema_hash on dataset upload; mid-run validation that each case parses against the declared schema; first 10 schema failures abort the run with a clear error rather than silently skipping cases.
Score drift from judge model upgrades. Pin judge_model with a version (gpt-4o-2024-11-20, not "gpt-4o"). When the team wants to upgrade, run the new judge on a calibration set side-by-side with the old, surface the per-judge correlation, and only then swap.
Run stuck in running. Worker crashes mid-batch; SQS message visibility expires, another worker re-processes. Idempotency key = (run_id, case_id) in the results table prevents double-write.
Budget exhaustion mid-month. Hard-fail new runs with status='budget_blocked'; CI fails fast with a link to the budget page; team admin can request a top-up that goes through approval before resuming.
Postgres failover. Multi-AZ replica with auto-failover; runs in flight write idempotently, so a 30s blip costs progress but not correctness.

9. Cost Analysis

Per 1,000 cases evaluated (one LLM call + one judge call):

Component	Cost / 1k cases
LLM under test (Claude Sonnet, ~600 in / 200 out)	$3.60
Judge (GPT-4o-mini, 800 in / 100 out)	$0.18
Judge (GPT-4o, 800 in / 100 out)	$3.00
Worker compute (Fargate, ~1.5s @ 0.5 vCPU)	$0.02
Postgres write + storage (90d hot)	$0.05
S3 lake storage (2y cold)	$0.01
Total / 1k (Sonnet + 4o-mini judge)	~$3.86
Total / 1k (Sonnet + GPT-4o judge)	~$6.68

The judge model choice swings total cost by ~70%. Tier the judges: cheap deterministic checks (regex, JSON-parse, exact-match) for free; mini-class LLM judge for the 80% obvious cases; GPT-4o only for borderline scores or as the periodic calibration baseline. This brings effective per-case cost back near the LLM-under-test floor.

10. Tradeoffs & Alternatives

Option	Wins on	Loses on
This DIY platform	Custom judges, integration with internal CI, no per-trace pricing, full data ownership.	Build cost; maintenance; needs a dedicated team.
LangSmith	Polished UI, built-in evaluators, tight LangChain integration.	Per-trace pricing scales painfully at 50 teams; vendor data residency; coupling to LangChain idioms.
Arize Phoenix	Open source, OpenTelemetry-native, runs on your own infra.	Less opinionated — you still build the regression detector and team-budget governance.
Promptfoo	YAML-driven, runs locally and in CI, no infra to host.	No central dashboards across teams; no historical store; no online A/B.
Braintrust	SaaS, strong dataset versioning + diff UI, fast onboarding.	Pricing model; data leaves the building; less customization.
OpenAI Evals	Free, transparent, large eval registry.	Single-tenant tooling; no online A/B; very limited governance.

Decision rule. Below ~5 teams or < 10k evals/day total, buy (LangSmith, Braintrust). Above that, the per-trace pricing crosses build cost within a year, and the customization need (internal CI, per-team budgets, custom judges) starts to outweigh polished UX. Phoenix is a reasonable middle ground: open-source core, build the orchestration and governance on top.

11. Common Interview Q&A

Common Interview Questions:

How do you keep judge cost from running away?

Three layers. (1) Per-team monthly budget enforced at run-create time; estimate cost up front from case count and reject runs that would breach. (2) Mid-run circuit breaker: every 1k cases, compare actual cost to the estimate; pause if > 1.5x. (3) Tier the judges — deterministic checks (regex, JSON-parse, exact-match) for free, mini-class LLM (gpt-4o-mini, Haiku) for the bulk of cases, full GPT-4o only for borderline scores or periodic calibration. Layer (3) typically cuts judge spend 5–10x with negligible quality loss.

Why split OLTP (Postgres) from analytical (Trino on Parquet)?

Different access patterns, different optimizers. Postgres handles the write-heavy hot path — transactional inserts of results and judgments, indexed lookups for "show me this run" dashboard pages — with predictable sub-second latency. Cross-team trend queries ("90-day metric over all teams, all datasets") are columnar workloads that Postgres does poorly above 100 GB; pushing them to Parquet on S3 with Trino keeps OLTP fast and isolates analytical load. CDC keeps the lake within seconds of OLTP without a separate ETL pipeline.

How do you avoid alert fatigue from regression detection?

Three knobs. Set a minimum sample size (N ≥ 100 today vs trailing 30 days) so single noisy runs don't fire. Use z-score against the trailing window's stdev (2.5σ threshold), not absolute deltas — some metrics are inherently noisy. Per-team mute list with required justification and expiry; muted alerts surface in a weekly digest so they don't get forgotten. The combination keeps the false-positive rate near 5%, which is what teams will actually act on.

How do you handle a judge model upgrade (e.g., gpt-4o-2024-08 to gpt-4o-2024-11)?

Pin the judge model with a date (gpt-4o-2024-11-20, never "gpt-4o" alias). For an upgrade, run both old and new judge on a fixed calibration set (~500 representative cases per judge family); compute correlation, mean-shift, and per-bucket disagreement. If correlation is > 0.9 and mean-shift is small, swap. If not, either keep the old judge or treat the new one as a separate metric and migrate teams individually. Never silently swap — baselines drift, alerts fire, trust evaporates.

What's the right way to do online A/B in this system?

Two pieces. First, traffic splitting happens in the team's serving layer (their feature flag service, e.g., LaunchDarkly or Statsig), not the eval platform — we don't sit in the request path. Teams tag each interaction with variant_id in the log payload they ship to us. Second, post-hoc judging: a sampler picks K% of logged interactions per variant per day, drops them through the same judge pipeline as offline, writes to results + judgments with triggered_by='online'. Dashboards group by variant_id, show metric deltas with confidence intervals from a Mann-Whitney U test (scores are not normal). This keeps the eval platform out of the critical serving path while reusing all the judge and dashboard machinery.

How do you make eval runs reproducible months later?

Pin everything. Dataset version + content hash, prompt version + content hash, LLM model with a date-stamped revision, judge model with a date-stamped revision, and the platform's own git SHA logged on the run row. Store the raw model output and judge rationale in results / judgments so you can re-judge later without re-invoking the LLM. The S3 lake is append-only; old judgments are never overwritten, so a run from six months ago is queryable in full. The only thing that's not perfectly reproducible is API-side determinism — setting temperature=0 and seed=N helps but providers don't all honor it, so for "is this regression real?" you re-run the eval rather than trusting old numbers verbatim.

↑ Back to Top