LangGraph and DSPy

Two ways to build LLM apps in 2026 beyond linear LangChain chains: LangGraph, which models your app as a stateful graph with conditional edges and checkpointing; and DSPy, which models your app as declarative LM programs with optimizers that compile better prompts for you. They solve different problems. Most teams need one, neither, or — occasionally — both.

1. Why Graphs Beat Chains

A chain runs A → B → C in order. Real LLM apps don't look like that:

The router decides whether to retrieve, ask a clarifying question, or call a tool.
The validator either accepts the answer or sends it back to the generator with feedback.
A long-running agent pauses for human approval before executing a destructive action.
An error in one step triggers a different recovery path, not a hard failure.

You can express all of this with if-statements and a while loop — and for simple cases you should. LangGraph helps when the state machine gets large enough that you need it to be inspectable, checkpointable, and resumable.

2. LangGraph: Nodes, Edges, State

LangGraph is built around three concepts:

State: a typed dict that flows through the graph; each node returns updates, which LangGraph merges (default) or appends (with reducers like add_messages).
Nodes: plain Python functions (state) -> partial_state. They can call LLMs, tools, anything.
Edges: connect nodes. Conditional edges let one node fan out to several based on a routing function.


pip install langgraph langchain-anthropic


from typing import TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-opus-4-7")

class State(TypedDict):
    question: str
    draft: str
    feedback: str
    iterations: int
    final: str

def draft_node(state: State) -> dict:
    prompt = f"Question: {state['question']}\nFeedback so far: {state.get('feedback', 'none')}\nWrite an answer."
    return {"draft": llm.invoke(prompt).content,
            "iterations": state.get("iterations", 0) + 1}

def critique_node(state: State) -> dict:
    prompt = (f"Question: {state['question']}\nAnswer: {state['draft']}\n"
              "Reply with PASS if the answer is accurate and complete; otherwise reply "
              "with FAIL: .")
    verdict = llm.invoke(prompt).content
    if verdict.startswith("PASS"):
        return {"final": state["draft"]}
    return {"feedback": verdict}

def route(state: State) -> Literal["draft", "done"]:
    if state.get("final"):
        return "done"
    if state.get("iterations", 0) >= 3:
        return "done"
    return "draft"

graph = StateGraph(State)
graph.add_node("draft", draft_node)
graph.add_node("critique", critique_node)
graph.add_edge(START, "draft")
graph.add_edge("draft", "critique")
graph.add_conditional_edges("critique", route, {"draft": "draft", "done": END})

app = graph.compile()

result = app.invoke({"question": "Explain CAP theorem in 4 sentences."})
print(result["final"] or result["draft"])

3. Checkpointing and Human-in-the-Loop

Compile the graph with a checkpointer and the entire state at every step is persisted by thread_id. You get three things for free:

Resume: pick up exactly where a crashed run left off.
Time travel: rewind to a checkpoint and re-run with edited state.
Human-in-the-loop: interrupt pauses the graph; an external system inspects state, supplies a value (e.g., approval), and resumes.


from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.types import interrupt, Command

memory = SqliteSaver.from_conn_string("checkpoints.db")

def approval_node(state: State) -> dict:
    decision = interrupt({"prompt": "Approve sending this email?", "draft": state["draft"]})
    if decision != "yes":
        return {"final": "[user declined]"}
    return {"final": state["draft"]}

graph.add_node("approval", approval_node)
graph.add_edge("critique", "approval")
graph.add_edge("approval", END)

app = graph.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "user-42"}}
state = app.invoke({"question": "Draft a follow-up email about invoice A-482."}, config=config)
# graph paused at the interrupt; later, after the human approves:
state = app.invoke(Command(resume="yes"), config=config)

4. DSPy: Signatures and Modules

DSPy (Khattab et al., Stanford) takes a different angle: instead of writing prompts, you declare what your program does (input/output types), and DSPy generates and improves the prompt automatically. The unit of programming is the signature; the runtime unit is the module (Predict, ChainOfThought, ReAct, ...).


pip install dspy-ai


import dspy

dspy.settings.configure(lm=dspy.LM("anthropic/claude-opus-4-7", max_tokens=1024))

class AnswerWithCitations(dspy.Signature):
    """Answer the question using only the provided context. Cite chunks by index."""
    context: list[str] = dspy.InputField()
    question: str      = dspy.InputField()
    answer: str        = dspy.OutputField(desc="answer including [1], [2] citation markers")

answerer = dspy.ChainOfThought(AnswerWithCitations)

result = answerer(
    context=["[1] Effective 2026, primary caregivers receive 16 weeks paid leave.",
             "[2] Secondary caregivers receive 8 weeks paid leave.",
             "[3] Pet bereavement leave is 3 days."],
    question="How long is paid parental leave for primary caregivers?",
)
print(result.answer)

Notice you wrote no prompt. The signature plus DSPy's prompt template is the prompt — and it can be re-templated and re-optimized as the LM changes.

5. DSPy Optimizers (BootstrapFewShot, MIPROv2)

DSPy's superpower is its optimizers (formerly "teleprompters"). They take your DSPy program, a metric, and a small training set, and produce a better version of the program — typically by mining few-shot examples from runs that scored well, and/or by rewriting instructions.

BootstrapFewShot: run the program on training inputs, keep the input/output pairs that pass your metric, and bake them in as few-shot demos.
BootstrapFewShotWithRandomSearch: same, with a random search over candidate demo subsets.
MIPROv2: jointly optimizes instructions and demos using Bayesian search; the strongest general-purpose optimizer in DSPy.
BootstrapFinetune: produces SFT data and fine-tunes a smaller open model that mimics the larger LM's behavior on the task.


import dspy
from dspy.teleprompt import MIPROv2

# Training set: list of dspy.Example with the same fields as the signature.
trainset = [
    dspy.Example(context=[...], question="...", answer="...").with_inputs("context", "question"),
    # ...30-200 examples
]

def citation_match(example, pred, trace=None) -> float:
    # Custom metric: 1.0 if the predicted answer contains every cited chunk index
    # that the ground-truth answer contains.
    return float(set(extract_citations(pred.answer)) >= set(extract_citations(example.answer)))

optimizer = MIPROv2(metric=citation_match, auto="medium")
compiled = optimizer.compile(student=answerer, trainset=trainset, num_trials=20)

compiled.save("answerer_compiled.json")     # ship this with your app

The output of compile is the same module with new internal state — better instructions, better demos. You call it identically to the original.

6. Side-by-Side: Same Task in LangGraph and DSPy

Task: a small RAG agent that retrieves, drafts an answer, and self-critiques once before returning.

LangGraph version — orchestration is explicit; you control every transition.


from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-opus-4-7")

class State(TypedDict):
    question: str
    contexts: list[str]
    draft: str
    final: str

def retrieve(state):  return {"contexts": my_retriever(state["question"])}
def draft(state):     return {"draft": llm.invoke(make_prompt(state)).content}
def critique(state):
    verdict = llm.invoke(critique_prompt(state)).content
    return {"final": state["draft"]} if verdict.startswith("PASS") else {"draft": revise(state, verdict)}

g = StateGraph(State)
for name, fn in [("retrieve", retrieve), ("draft", draft), ("critique", critique)]:
    g.add_node(name, fn)
g.add_edge(START, "retrieve"); g.add_edge("retrieve", "draft")
g.add_edge("draft", "critique"); g.add_edge("critique", END)
app = g.compile()

DSPy version — orchestration is just Python; the prompts are declared and optimizable.


import dspy

dspy.settings.configure(lm=dspy.LM("anthropic/claude-opus-4-7"))

class Answer(dspy.Signature):
    """Answer the question using only the provided context."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

class Critique(dspy.Signature):
    """Reply PASS if the answer is accurate and grounded, else FAIL: ."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    verdict: str = dspy.OutputField()

class RagWithCritique(dspy.Module):
    def __init__(self, retrieve):
        self.retrieve = retrieve
        self.answerer = dspy.ChainOfThought(Answer)
        self.critic = dspy.Predict(Critique)

    def forward(self, question: str):
        ctx = self.retrieve(question)
        ans = self.answerer(context=ctx, question=question).answer
        verdict = self.critic(context=ctx, question=question, answer=ans).verdict
        if verdict.startswith("FAIL"):
            ans = self.answerer(context=ctx, question=f"{question}\nFix: {verdict}").answer
        return dspy.Prediction(answer=ans)

program = RagWithCritique(retrieve=my_retriever)
print(program(question="...").answer)

The DSPy version is shorter and the prompts can be optimized end-to-end. The LangGraph version is more transparent about state transitions and gives you checkpointing for free. Choose the one whose pain you'd rather have.

7. When to Choose Which (or Neither)

Choose LangGraph when:

Your agent has branching control flow with more than ~3 distinct paths.
You need checkpointing, resumability, or human-in-the-loop pauses.
You want the state machine to be visible (LangGraph Studio renders the graph).
You are already on the LangChain stack and use LangSmith for tracing.

Choose DSPy when:

You have a labeled training set and want to optimize prompts (and few-shot demos) automatically.
You expect to swap LMs frequently — DSPy's "compile against this LM" workflow is built for this.
The pipeline is a small set of LM calls (typed inputs → typed outputs), not a complex state machine.
You want the prompts treated as code: versioned, optimized, regenerated.

Choose neither when:

The whole task is one Bedrock Converse call or one Anthropic messages.create. Adding a framework for one call is overhead with no upside.
Your agent loop is <100 lines of Python you can read end-to-end.
You need extreme latency control or precise observability and the framework is in the way.

Common pattern in production: a single Bedrock or Anthropic call for 80% of requests, LangGraph for the 20% that need a real agent loop, and DSPy in an offline pipeline that compiles the prompts both of those use. None of these is mutually exclusive.

Common Interview Questions:

When does LangGraph beat raw chains?

LangGraph wins when your control flow has cycles, conditional branches, or human-in-the-loop checkpoints — anything a DAG-shaped LCEL chain can't express. Concrete examples: an agent that loops "plan → act → reflect" until done, a RAG pipeline that re-queries when faithfulness is low, a multi-step approval workflow that pauses for human input mid-flight. For pure linear pipelines (load → chunk → embed → store) LangGraph is overkill; a function with three calls is clearer. The graph abstraction earns its weight when you're drawing arrows on a whiteboard, not boxes.

What does DSPy actually compile?

DSPy compiles a program (defined as Modules with typed Signatures) by running an optimizer that searches over few-shot examples and prompt phrasings to maximize a metric you provide. The "compiled" artifact is a set of optimized prompts — nothing magical, just text — bound to each Module. The optimizer (BootstrapFewShot, MIPRO, COPRO) treats prompt engineering as a learning problem: given a small training set and a metric (exact-match, F1, RAGAS faithfulness), it iteratively proposes and evaluates prompts. The point is that your code stays in terms of "what" the program should do, and the optimizer fills in the "how" each time you swap models.

How does LangGraph checkpointing work?

LangGraph persists graph state to a checkpointer (in-memory, SQLite, Postgres, Redis) at every node transition. State is keyed by thread_id, so a single conversation across many user turns reads/writes the same checkpoint chain. Two concrete uses: long-running agents survive process restarts because state is durable; human-in-the-loop pauses by interrupting before a node, persisting state, and resuming when the human approves. Postgres is the production default — you get time-travel debugging (replay from any prior checkpoint) and multi-pod horizontal scaling for free.

Why would I use DSPy instead of just writing prompts in a YAML file?

YAML prompts work fine until you swap models — the prompt that was tuned for GPT-4 underperforms on Claude or Llama, and you re-tune by hand for every migration. DSPy decouples the program structure from the prompt strings: you re-run the optimizer against the new model and the metric tells you when you've matched the prior baseline. It also forces you to write an evaluation metric, which you should be doing anyway. The investment pays off when you have multiple models in rotation, frequent prompt regressions, or a real eval set you trust.

When should you skip both frameworks and write raw code?

Skip both when: the workflow is a single API call with retries (three lines of Python beat a graph definition); your team doesn't yet have an eval set, so DSPy has nothing to optimize against; or extreme latency control matters and the framework call stack is in the way of your profiler. The dirty secret of both is that they shine in demos and complicate debugging in production. Common production pattern: a single Anthropic call for 80% of requests, LangGraph for the 20% that need a real loop, DSPy in an offline pipeline that compiles the prompts both of those use.

How do you debug a LangGraph that's behaving unexpectedly?

Three tools. First, LangSmith tracing — every node invocation, state mutation, and LLM call is logged with input/output, and you can replay from any step. Second, the checkpointer itself: dump the persisted state for a thread_id and inspect what the graph thought the world looked like when it made its bad decision. Third, run with a synchronous in-memory checkpointer in a notebook and step through node-by-node. The common failure mode is shared-state mutation: two parallel branches writing the same key in the state dict and the merge reducer picking the wrong one. Make every state field a Pydantic model with explicit reducers.

↑ Back to Top