LangGraph and DSPy

Two ways to build LLM apps in 2026 beyond linear LangChain chains: LangGraph, which models your app as a stateful graph with conditional edges and checkpointing; and DSPy, which models your app as declarative LM programs with optimizers that compile better prompts for you. They solve different problems. Most teams need one, neither, or — occasionally — both.



1. Why Graphs Beat Chains

A chain runs A → B → C in order. Real LLM apps don't look like that:

You can express all of this with if-statements and a while loop — and for simple cases you should. LangGraph helps when the state machine gets large enough that you need it to be inspectable, checkpointable, and resumable.


2. LangGraph: Nodes, Edges, State

LangGraph is built around three concepts:


pip install langgraph langchain-anthropic
  

from typing import TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-opus-4-7")

class State(TypedDict):
    question: str
    draft: str
    feedback: str
    iterations: int
    final: str

def draft_node(state: State) -> dict:
    prompt = f"Question: {state['question']}\nFeedback so far: {state.get('feedback', 'none')}\nWrite an answer."
    return {"draft": llm.invoke(prompt).content,
            "iterations": state.get("iterations", 0) + 1}

def critique_node(state: State) -> dict:
    prompt = (f"Question: {state['question']}\nAnswer: {state['draft']}\n"
              "Reply with PASS if the answer is accurate and complete; otherwise reply "
              "with FAIL: .")
    verdict = llm.invoke(prompt).content
    if verdict.startswith("PASS"):
        return {"final": state["draft"]}
    return {"feedback": verdict}

def route(state: State) -> Literal["draft", "done"]:
    if state.get("final"):
        return "done"
    if state.get("iterations", 0) >= 3:
        return "done"
    return "draft"

graph = StateGraph(State)
graph.add_node("draft", draft_node)
graph.add_node("critique", critique_node)
graph.add_edge(START, "draft")
graph.add_edge("draft", "critique")
graph.add_conditional_edges("critique", route, {"draft": "draft", "done": END})

app = graph.compile()

result = app.invoke({"question": "Explain CAP theorem in 4 sentences."})
print(result["final"] or result["draft"])
  

3. Checkpointing and Human-in-the-Loop

Compile the graph with a checkpointer and the entire state at every step is persisted by thread_id. You get three things for free:


from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.types import interrupt, Command

memory = SqliteSaver.from_conn_string("checkpoints.db")

def approval_node(state: State) -> dict:
    decision = interrupt({"prompt": "Approve sending this email?", "draft": state["draft"]})
    if decision != "yes":
        return {"final": "[user declined]"}
    return {"final": state["draft"]}

graph.add_node("approval", approval_node)
graph.add_edge("critique", "approval")
graph.add_edge("approval", END)

app = graph.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "user-42"}}
state = app.invoke({"question": "Draft a follow-up email about invoice A-482."}, config=config)
# graph paused at the interrupt; later, after the human approves:
state = app.invoke(Command(resume="yes"), config=config)
  

4. DSPy: Signatures and Modules

DSPy (Khattab et al., Stanford) takes a different angle: instead of writing prompts, you declare what your program does (input/output types), and DSPy generates and improves the prompt automatically. The unit of programming is the signature; the runtime unit is the module (Predict, ChainOfThought, ReAct, ...).


pip install dspy-ai
  

import dspy

dspy.settings.configure(lm=dspy.LM("anthropic/claude-opus-4-7", max_tokens=1024))

class AnswerWithCitations(dspy.Signature):
    """Answer the question using only the provided context. Cite chunks by index."""
    context: list[str] = dspy.InputField()
    question: str      = dspy.InputField()
    answer: str        = dspy.OutputField(desc="answer including [1], [2] citation markers")

answerer = dspy.ChainOfThought(AnswerWithCitations)

result = answerer(
    context=["[1] Effective 2026, primary caregivers receive 16 weeks paid leave.",
             "[2] Secondary caregivers receive 8 weeks paid leave.",
             "[3] Pet bereavement leave is 3 days."],
    question="How long is paid parental leave for primary caregivers?",
)
print(result.answer)
  

Notice you wrote no prompt. The signature plus DSPy's prompt template is the prompt — and it can be re-templated and re-optimized as the LM changes.


5. DSPy Optimizers (BootstrapFewShot, MIPROv2)

DSPy's superpower is its optimizers (formerly "teleprompters"). They take your DSPy program, a metric, and a small training set, and produce a better version of the program — typically by mining few-shot examples from runs that scored well, and/or by rewriting instructions.


import dspy
from dspy.teleprompt import MIPROv2

# Training set: list of dspy.Example with the same fields as the signature.
trainset = [
    dspy.Example(context=[...], question="...", answer="...").with_inputs("context", "question"),
    # ...30-200 examples
]

def citation_match(example, pred, trace=None) -> float:
    # Custom metric: 1.0 if the predicted answer contains every cited chunk index
    # that the ground-truth answer contains.
    return float(set(extract_citations(pred.answer)) >= set(extract_citations(example.answer)))

optimizer = MIPROv2(metric=citation_match, auto="medium")
compiled = optimizer.compile(student=answerer, trainset=trainset, num_trials=20)

compiled.save("answerer_compiled.json")     # ship this with your app
  

The output of compile is the same module with new internal state — better instructions, better demos. You call it identically to the original.


6. Side-by-Side: Same Task in LangGraph and DSPy

Task: a small RAG agent that retrieves, drafts an answer, and self-critiques once before returning.

LangGraph version — orchestration is explicit; you control every transition.


from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-opus-4-7")

class State(TypedDict):
    question: str
    contexts: list[str]
    draft: str
    final: str

def retrieve(state):  return {"contexts": my_retriever(state["question"])}
def draft(state):     return {"draft": llm.invoke(make_prompt(state)).content}
def critique(state):
    verdict = llm.invoke(critique_prompt(state)).content
    return {"final": state["draft"]} if verdict.startswith("PASS") else {"draft": revise(state, verdict)}

g = StateGraph(State)
for name, fn in [("retrieve", retrieve), ("draft", draft), ("critique", critique)]:
    g.add_node(name, fn)
g.add_edge(START, "retrieve"); g.add_edge("retrieve", "draft")
g.add_edge("draft", "critique"); g.add_edge("critique", END)
app = g.compile()
  

DSPy version — orchestration is just Python; the prompts are declared and optimizable.


import dspy

dspy.settings.configure(lm=dspy.LM("anthropic/claude-opus-4-7"))

class Answer(dspy.Signature):
    """Answer the question using only the provided context."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

class Critique(dspy.Signature):
    """Reply PASS if the answer is accurate and grounded, else FAIL: ."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    verdict: str = dspy.OutputField()

class RagWithCritique(dspy.Module):
    def __init__(self, retrieve):
        self.retrieve = retrieve
        self.answerer = dspy.ChainOfThought(Answer)
        self.critic = dspy.Predict(Critique)

    def forward(self, question: str):
        ctx = self.retrieve(question)
        ans = self.answerer(context=ctx, question=question).answer
        verdict = self.critic(context=ctx, question=question, answer=ans).verdict
        if verdict.startswith("FAIL"):
            ans = self.answerer(context=ctx, question=f"{question}\nFix: {verdict}").answer
        return dspy.Prediction(answer=ans)

program = RagWithCritique(retrieve=my_retriever)
print(program(question="...").answer)
  

The DSPy version is shorter and the prompts can be optimized end-to-end. The LangGraph version is more transparent about state transitions and gives you checkpointing for free. Choose the one whose pain you'd rather have.


7. When to Choose Which (or Neither)

Choose LangGraph when:

Choose DSPy when:

Choose neither when:

Common pattern in production: a single Bedrock or Anthropic call for 80% of requests, LangGraph for the 20% that need a real agent loop, and DSPy in an offline pipeline that compiles the prompts both of those use. None of these is mutually exclusive.



Common Interview Questions:

When does LangGraph beat raw chains?

LangGraph wins when your control flow has cycles, conditional branches, or human-in-the-loop checkpoints — anything a DAG-shaped LCEL chain can't express. Concrete examples: an agent that loops "plan → act → reflect" until done, a RAG pipeline that re-queries when faithfulness is low, a multi-step approval workflow that pauses for human input mid-flight. For pure linear pipelines (load → chunk → embed → store) LangGraph is overkill; a function with three calls is clearer. The graph abstraction earns its weight when you're drawing arrows on a whiteboard, not boxes.

What does DSPy actually compile?

DSPy compiles a program (defined as Modules with typed Signatures) by running an optimizer that searches over few-shot examples and prompt phrasings to maximize a metric you provide. The "compiled" artifact is a set of optimized prompts — nothing magical, just text — bound to each Module. The optimizer (BootstrapFewShot, MIPRO, COPRO) treats prompt engineering as a learning problem: given a small training set and a metric (exact-match, F1, RAGAS faithfulness), it iteratively proposes and evaluates prompts. The point is that your code stays in terms of "what" the program should do, and the optimizer fills in the "how" each time you swap models.

How does LangGraph checkpointing work?

LangGraph persists graph state to a checkpointer (in-memory, SQLite, Postgres, Redis) at every node transition. State is keyed by thread_id, so a single conversation across many user turns reads/writes the same checkpoint chain. Two concrete uses: long-running agents survive process restarts because state is durable; human-in-the-loop pauses by interrupting before a node, persisting state, and resuming when the human approves. Postgres is the production default — you get time-travel debugging (replay from any prior checkpoint) and multi-pod horizontal scaling for free.

Why would I use DSPy instead of just writing prompts in a YAML file?

YAML prompts work fine until you swap models — the prompt that was tuned for GPT-4 underperforms on Claude or Llama, and you re-tune by hand for every migration. DSPy decouples the program structure from the prompt strings: you re-run the optimizer against the new model and the metric tells you when you've matched the prior baseline. It also forces you to write an evaluation metric, which you should be doing anyway. The investment pays off when you have multiple models in rotation, frequent prompt regressions, or a real eval set you trust.

When should you skip both frameworks and write raw code?

Skip both when: the workflow is a single API call with retries (three lines of Python beat a graph definition); your team doesn't yet have an eval set, so DSPy has nothing to optimize against; or extreme latency control matters and the framework call stack is in the way of your profiler. The dirty secret of both is that they shine in demos and complicate debugging in production. Common production pattern: a single Anthropic call for 80% of requests, LangGraph for the 20% that need a real loop, DSPy in an offline pipeline that compiles the prompts both of those use.

How do you debug a LangGraph that's behaving unexpectedly?

Three tools. First, LangSmith tracing — every node invocation, state mutation, and LLM call is logged with input/output, and you can replay from any step. Second, the checkpointer itself: dump the persisted state for a thread_id and inspect what the graph thought the world looked like when it made its bad decision. Third, run with a synchronous in-memory checkpointer in a notebook and step through node-by-node. The common failure mode is shared-state mutation: two parallel branches writing the same key in the state dict and the merge reducer picking the wrong one. Make every state field a Pydantic model with explicit reducers.

↑ Back to Top