RAG Evaluation with RAGAS

RAG systems fail in interesting ways: the retriever returns the wrong chunks, the model ignores the chunks it did get, the model invents citations, the chunks are right but split mid-sentence, the user's question is ambiguous. A single end-to-end "is the answer good?" score collapses all of these into one number that tells you nothing about which dial to turn. RAGAS gives you four numbers instead, each isolating one part of the pipeline.

This page walks through the four core RAGAS metrics, how to build a usable eval dataset, how to run LLM-as-judge without fooling yourself, and how to wire eval into CI.



1. Why RAG Eval Is Hard


2. The Four RAGAS Metrics

RAGAS (Es et al., 2023) decomposes RAG quality into four orthogonal axes. Two need a reference answer (context_recall, answer_correctness); two are reference-free.

Metric Question it answers Inputs What a low score means
faithfulness Is every claim in the answer supported by the retrieved context? question, answer, contexts The model is hallucinating or going beyond the docs.
answer_relevancy Does the answer actually address the question (not just say something true)? question, answer The answer is on-topic-ish but doesn't answer what was asked.
context_precision Of the retrieved chunks, how many are actually relevant, and are the relevant ones ranked high? question, contexts (and ground_truth if available) Retriever is dragging in noise; reranker is mis-ordering.
context_recall Did the retriever fetch all the chunks needed to answer? question, contexts, ground_truth The right chunks are not in your top-K; tune K, embeddings, or chunking.

Mental model: faithfulness + answer_relevancy grade the generator. context_precision + context_recall grade the retriever. If faithfulness is low but context_recall is high, the model is ignoring good chunks — fix the prompt. If both context metrics are low, fix the retriever first.


3. Building a Gold/Eval Dataset

Two complementary sources:

  1. Synthetic generation: ask Claude to read a document and produce (question, answer, source span) triples. Fast; covers the corpus uniformly; biased toward the kinds of questions LLMs find easy.
  2. Real user logs: the questions you actually get in production. Slow to label; biased toward what users ask, which is what you actually need to be good at.

Aim for ~50 hand-curated cases at minimum — enough to see metric movement, small enough to label well. Tag each case with a category (factual lookup, multi-hop, summarization, edge case) so you can break out scores per category.


# Synthetic eval generation with Claude.
import anthropic, json

client = anthropic.Anthropic()

PROMPT = """Read the document below and generate {n} diverse evaluation questions.
For each, return a JSON object with:
  question     - a question a user might actually ask
  ground_truth - the answer, taken verbatim or paraphrased from the document
  source_span  - the exact substring of the document that supports the answer
  difficulty   - "easy" (single sentence), "medium" (multi-sentence), "hard" (multi-section synthesis)

Return a JSON array. Document:
---
{doc}
---"""

def gen_eval_cases(doc: str, n: int = 10) -> list[dict]:
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=[{"role": "user", "content": PROMPT.format(n=n, doc=doc)}],
    )
    return json.loads(resp.content[0].text)
  

Always have a human review the synthetic set before treating its scores as ground truth. The model that wrote the questions cannot grade itself — use a different judge model where possible.


4. LLM-as-Judge Done Properly

Almost every RAGAS metric is implemented as an LLM call. The judge is the most overlooked failure point in any eval pipeline.


# Minimal pairwise judge with position randomization.
import random, json, anthropic

client = anthropic.Anthropic()

JUDGE_PROMPT = """You are an evaluator. Given a question and two answers (A and B), decide which
is more faithful to the provided context. Respond with JSON:
  winner

Question: {q}
Context: {ctx}
Answer A: {a}
Answer B: {b}"""

def judge(question, context, ans1, ans2):
    if random.random() < 0.5:
        a, b, swap = ans1, ans2, False
    else:
        a, b, swap = ans2, ans1, True
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=300,
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(q=question, ctx=context, a=a, b=b)}],
    )
    out = json.loads(resp.content[0].text)
    if swap and out["winner"] in ("A", "B"):
        out["winner"] = "B" if out["winner"] == "A" else "A"
    return out
  

5. Offline vs Online Evaluation

Both are necessary. Offline alone tells you "we didn't break the cases we already had." Online alone is too slow and noisy to gate deployments. Run offline as a CI gate, online as a continuous monitor.


6. Tools: RAGAS, TruLens, DeepEval, LangSmith, Phoenix

ToolStrengthsBest for
RAGAS Pure-Python metrics library, framework-agnostic, plays well with HF datasets. Offline eval scripts and CI.
TruLens Trace recording + metric "feedback functions" you write or pick from a library. Local dev iteration with rich traces.
DeepEval pytest-style API, large built-in metric set, GitHub Actions integration. Engineers who want unit-test-style eval ergonomics.
LangSmith Hosted traces + datasets + automated eval; tightly integrated with LangChain/LangGraph. Teams already on the LangChain stack.
Phoenix / Arize OpenTelemetry-based traces, in-notebook eval UI, online monitoring. Production observability with eval overlay.

7. End-to-End RAGAS Code Example


pip install ragas datasets langchain-openai
  

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Each row: question, the answer your RAG system produced, the contexts it retrieved,
# and the human-written ground_truth answer.
data = {
    "question": [
        "What is our 2026 parental leave policy?",
        "How many vacation days do new hires get?",
    ],
    "answer": [
        "16 weeks of paid leave for primary caregivers and 8 weeks for secondary caregivers.",
        "New hires get 15 days of PTO in their first year.",
    ],
    "contexts": [
        ["Effective Jan 2026, primary caregivers receive 16 weeks paid leave; secondary caregivers receive 8 weeks."],
        ["All new hires accrue 15 PTO days during year 1, increasing to 20 in year 2."],
    ],
    "ground_truth": [
        "Primary caregivers receive 16 weeks paid leave; secondary caregivers receive 8 weeks.",
        "15 days in year 1.",
    ],
}

dataset = Dataset.from_dict(data)

# Strong judge model + same-team embeddings.
judge = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-large"))

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=judge,
    embeddings=emb,
)

print(result)
# {'faithfulness': 1.000, 'answer_relevancy': 0.962,
#  'context_precision': 1.000, 'context_recall': 1.000}

df = result.to_pandas()
df.to_csv("eval_results.csv", index=False)
  

8. Common Failure Modes


9. Sample Eval CI Workflow

Run RAGAS on every PR that touches the RAG pipeline and fail the build on regressions.


# .github/workflows/rag-eval.yml
name: rag-eval
on:
  pull_request:
    paths: ["rag/**", "prompts/**", "eval/**"]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install -r requirements.txt

      - name: Run RAG over eval set
        env:
          OPENAI_API_KEY: $
        run: python eval/run_rag.py --in eval/golden.jsonl --out eval/preds.jsonl

      - name: Score with RAGAS
        env:
          OPENAI_API_KEY: $
        run: python eval/score.py --preds eval/preds.jsonl --out eval/scores.json

      - name: Compare to baseline
        run: |
          python eval/compare.py \
            --baseline eval/baseline_scores.json \
            --current  eval/scores.json \
            --threshold 0.02   # fail if any metric drops > 2 points

      - uses: actions/upload-artifact@v4
        with: { name: eval-results, path: eval/scores.json }
  

Keep eval/baseline_scores.json in the repo and update it deliberately when you accept a regression (e.g., trading 1 point of faithfulness for 5 points of context_recall). The diff in the PR is the conversation.



Common Interview Questions:

What are the four core RAGAS metrics and what does each catch?

Faithfulness measures whether claims in the answer are grounded in the retrieved context — catches hallucinations. Answer relevance measures whether the answer addresses the question (uses reverse-question generation: ask the LLM to write the question implied by the answer, then cosine-compare to the original). Context precision measures whether retrieved chunks are actually relevant (signal-to-noise of retrieval). Context recall measures whether the chunks covered everything needed to construct the ground-truth answer. Faithfulness + answer relevance evaluate the generator; precision + recall evaluate the retriever.

How do you guard against judge-LLM bias?

Three techniques. First, use a different model family as judge than as generator — a Claude-judge on a GPT-generated answer, or vice versa — because models prefer their own style. Second, randomize position when the metric is pairwise (LLMs have a strong "first answer wins" bias). Third, calibrate periodically: have a human label 50–100 examples and check the judge's agreement with humans (Cohen's kappa); if it drops below ~0.6, re-prompt or switch judge. For high-stakes deployments, ensemble multiple judge models and take majority vote.

How do you build a gold dataset without spending months?

Bootstrap with a synthesizer: feed RAGAS or a custom prompt your corpus and have an LLM generate (question, answer, source_chunks) tuples, then hand-review and edit. 200–500 high-quality examples beats 5,000 noisy ones. Stratify across question types (factoid, multi-hop, comparison, "no answer") and document types so a single bad slice can't dominate the score. Keep a separate "regression set" of every real-world bad answer a customer reported — those are the cases that actually matter.

Where in CI does the eval run, and what does it block?

I run RAGAS as a GitHub Actions job on every PR that touches the prompts/, retrieval/, or models config directory — gated by path filters so docs PRs don't trigger it. The job loads the gold set, runs the new pipeline, scores all four metrics, and diffs against eval/baseline_scores.json in the repo. Any metric dropping more than 1.5 points fails the check; the dev either fixes the regression or commits a new baseline with a justification in the PR description. Results are uploaded as an artifact so the diff is reviewable.

Why is faithfulness low even when the right context was retrieved?

Usually one of: the prompt doesn't tell the model "answer only from the provided context", so it pads with prior knowledge; the chunks contain conflicting information and the model picks the more popular fact rather than the cited one; or the answer paraphrases so loosely the claim-extractor can't match it back to the source. Fixes in order of effort: tighten the system prompt with explicit "if not in context, say I don't know"; add a re-ranker so the most-supporting chunk is at the top; switch to a larger model — small models hallucinate more even with good context.

How do you handle cost when the eval set grows?

Each RAGAS run is N samples × ~5 LLM calls per sample, so 500 samples on GPT-4o is real money per CI run. Mitigations: use a cheaper judge (GPT-4o-mini or Claude Haiku) for nightly runs, reserve the strong judge for release gates. Sample stratified subsets for PR-level checks (~50 examples) and run the full set on main. Cache judge responses by (sample_id, pipeline_hash) so re-runs on the same code are free. Track $/run as its own CI metric so cost regressions get caught.

↑ Back to Top