Function Calling and Structured Output

Most production LLM bugs are not "the model is dumb" — they're "the model returned {"price": "twelve dollars"} when downstream code expected a number." This page covers the four lines of defense: native function calling per provider, JSON-Schema-constrained outputs, Pydantic-typed wrappers (instructor), and constrained decoding (Outlines, vLLM grammar mode) when you really cannot tolerate a parse failure.

1. Why "Just Ask for JSON" Fails

Three failure modes appear with naked "respond in JSON" prompts:

Markdown wrapping: the model returns ```json\n{...}\n``` and your parser explodes.
Trailing commentary: {"a": 1}\n\nNote that this assumes... — perfectly readable, completely unparsable.
Schema drift: the model invents a field the schema doesn't have, omits a required one, or returns "42" instead of 42.

Native function-calling APIs solve the first two by construction (the API never returns prose around a tool call). Strict schema modes (OpenAI Structured Outputs, Gemini's response_schema with strict mode) and constrained decoding solve the third.

2. Native Function Calling Across Providers

2.1 OpenAI tools and Structured Outputs

OpenAI's tools field with strict: true guarantees the response matches your JSON Schema (subject to schema constraints — no $ref, all properties required and listed in required).


from openai import OpenAI

client = OpenAI()

extract_invoice = {
    "type": "function",
    "function": {
        "name": "extract_invoice",
        "description": "Extract structured fields from an invoice document.",
        "strict": True,
        "parameters": {
            "type": "object",
            "additionalProperties": False,
            "required": ["vendor", "invoice_number", "total_cents", "currency", "line_items"],
            "properties": {
                "vendor": {"type": "string"},
                "invoice_number": {"type": "string"},
                "total_cents": {"type": "integer"},
                "currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "additionalProperties": False,
                        "required": ["description", "quantity", "unit_cents"],
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "integer"},
                            "unit_cents": {"type": "integer"},
                        },
                    },
                },
            },
        },
    },
}

resp = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    tools=[extract_invoice],
    tool_choice={"type": "function", "function": {"name": "extract_invoice"}},
    messages=[{"role": "user", "content": INVOICE_TEXT}],
)
import json
data = json.loads(resp.choices[0].message.tool_calls[0].function.arguments)

OpenAI also exposes response_format={"type": "json_schema", "json_schema": {...}} for the same guarantees without a tool wrapper.

2.2 Anthropic tool_use

Anthropic enforces the schema strictly when you set tool_choice={"type": "tool", "name": "..."} — the model must respond with that tool call.


import anthropic, json

client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    tools=[{
        "name": "extract_invoice",
        "description": "Extract structured invoice fields.",
        "input_schema": {
            "type": "object",
            "required": ["vendor", "invoice_number", "total_cents", "currency"],
            "properties": {
                "vendor": {"type": "string"},
                "invoice_number": {"type": "string"},
                "total_cents": {"type": "integer", "minimum": 0},
                "currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
            },
        },
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{"role": "user", "content": INVOICE_TEXT}],
)

data = next(b.input for b in resp.content if b.type == "tool_use")

2.3 Amazon Bedrock toolConfig

Bedrock's Converse API normalizes tool use across Claude, Llama, Mistral, Cohere, and Nova — same shape, different modelId.


import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")

resp = bedrock.converse(
    modelId="anthropic.claude-opus-4-7",
    messages=[{"role": "user", "content": [{"text": INVOICE_TEXT}]}],
    toolConfig={
        "tools": [{
            "toolSpec": {
                "name": "extract_invoice",
                "description": "Extract structured invoice fields.",
                "inputSchema": {"json": {
                    "type": "object",
                    "required": ["vendor", "total_cents"],
                    "properties": {
                        "vendor": {"type": "string"},
                        "total_cents": {"type": "integer"},
                    },
                }},
            }
        }],
        "toolChoice": {"tool": {"name": "extract_invoice"}},
    },
)
for block in resp["output"]["message"]["content"]:
    if "toolUse" in block:
        data = block["toolUse"]["input"]

2.4 Google Gemini function declarations

Gemini supports both function declarations and a stricter response_schema with response_mime_type="application/json".


from google import genai
from google.genai import types

client = genai.Client()

config = types.GenerateContentConfig(
    response_mime_type="application/json",
    response_schema={
        "type": "OBJECT",
        "required": ["vendor", "total_cents"],
        "properties": {
            "vendor": {"type": "STRING"},
            "total_cents": {"type": "INTEGER"},
        },
    },
)

resp = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=INVOICE_TEXT,
    config=config,
)
import json
data = json.loads(resp.text)

3. Pydantic and the instructor Library

Writing JSON Schema by hand is tedious and error-prone. instructor patches the OpenAI / Anthropic / Bedrock / Gemini clients so you can pass a Pydantic model as response_model and get a typed object back.


pip install instructor pydantic anthropic openai


import instructor, anthropic
from pydantic import BaseModel, Field
from typing import Literal

class LineItem(BaseModel):
    description: str
    quantity: int = Field(ge=1)
    unit_cents: int = Field(ge=0)

class Invoice(BaseModel):
    vendor: str
    invoice_number: str
    total_cents: int = Field(ge=0)
    currency: Literal["USD", "EUR", "GBP"]
    line_items: list[LineItem]

client = instructor.from_anthropic(anthropic.Anthropic())

invoice: Invoice = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    response_model=Invoice,
    messages=[{"role": "user", "content": INVOICE_TEXT}],
)

# instructor validates with Pydantic; on ValidationError it automatically
# re-prompts the model with the validation errors as context, up to max_retries.
print(invoice.vendor, invoice.total_cents)

The retry-with-validation-errors loop is the killer feature — Pydantic's error messages are good enough that the model usually fixes its mistake on the first retry.

4. Constrained Decoding with Outlines and vLLM

For self-hosted models, you can guarantee schema compliance at the token level: at each decoding step, mask out tokens that would make the partial output invalid against a regex, JSON Schema, or context-free grammar. This is "structured generation" or "constrained decoding."

Outlines is the most popular library; it works with Hugging Face, vLLM, and llama.cpp backends.


import outlines
from pydantic import BaseModel

class Invoice(BaseModel):
    vendor: str
    total_cents: int

model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
generator = outlines.generate.json(model, Invoice)

invoice = generator(INVOICE_TEXT)   # always a valid Invoice — guaranteed by the decoder

vLLM exposes the same capability via its OpenAI-compatible API: pass guided_json, guided_regex, or guided_grammar in extra_body.


from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

schema = {
    "type": "object",
    "required": ["vendor", "total_cents"],
    "properties": {
        "vendor": {"type": "string"},
        "total_cents": {"type": "integer"},
    },
}

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": INVOICE_TEXT}],
    extra_body={"guided_json": schema, "guided_decoding_backend": "outlines"},
)

Constrained decoding eliminates parse errors but it does not eliminate semantic errors — the model can still produce a perfectly-shaped JSON with the wrong vendor name. Always combine with evals.

5. Recovering from Malformed JSON

When you cannot use a strict mode (some providers, older models, free-form responses with embedded JSON), recover instead of crashing:


import json, re
from json_repair import repair_json   # pip install json_repair

def extract_json(text: str) -> dict:
    # 1. Strip markdown fences.
    text = re.sub(r"^```(?:json)?\s*|\s*```$", "", text.strip(), flags=re.M)
    # 2. Try strict parse.
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
    # 3. Try the largest balanced {...} substring.
    match = re.search(r"\{.*\}", text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group(0))
        except json.JSONDecodeError:
            # 4. Repair (handles trailing commas, single quotes, missing braces).
            return json.loads(repair_json(match.group(0)))
    raise ValueError("no JSON object found")

6. Reliability Patterns

Validate, then re-prompt: on Pydantic ValidationError, send the error message back as a user turn ("Your previous response failed validation: error. Return only valid JSON matching the schema."). Two retries catches almost everything.
Pin the model version: gpt-4o-2024-08-06, claude-opus-4-7. Schema strictness changes between snapshots; do not let it surprise you in production.
Set temperature to 0 or 0.1 for extraction tasks. There is no creative upside.
Force the tool call: tool_choice set to a specific tool name removes the "model decides whether to call" branch.
Keep schemas flat: deeply-nested schemas hurt accuracy. Two levels of nesting is usually the sweet spot.
Enums beat strings: every place you can constrain a field to enum: [...], do it.
Log validation failures: they are your eval set. The cases that break in production are the ones you want in your test fixtures.

Common Interview Questions:

What does "strict mode" actually do?

Strict mode (OpenAI's strict: true, Anthropic's tool-use schema validation, Gemini's response_schema) constrains the decoder so the next token is always one that keeps the output a valid prefix of the JSON schema. The model literally cannot emit an invalid character. The cost is a slight latency hit from the constrained-decoding mask, plus a one-time schema-compilation cost cached server-side. Without strict mode you get "JSON mode" which guarantees valid JSON but not adherence to your schema — missing fields, wrong types, and extra fields all sneak through.

Why not just always use strict mode?

Strict mode supports a subset of JSON Schema — no oneOf, no $ref in some providers, no pattern regex on strings, all properties must be required (you mark optionality with type: ["string", "null"] instead). Deeply-nested or recursive schemas may fail to compile or hurt accuracy. So you flatten and normalize schemas to fit, and for genuinely complex shapes you fall back to JSON mode plus Pydantic validation with a repair loop.

How do you repair a malformed JSON response?

Three layers. First, try a permissive parser like json5 or dirty-json — that fixes trailing commas, single quotes, unquoted keys. Second, if Pydantic validation still fails, feed the validation error back to the model as a tool-result-style message: "your previous response failed validation: {error}. respond again with valid output." One round-trip fixes most cases. Third, if that fails twice, fall back to a smaller, cheaper model with strict mode and a simplified schema, or surface the error to the caller. Never silently drop fields — that's how data corruption ships.

When would you reach for Outlines or Instructor?

Instructor is a thin wrapper that gives you Pydantic-typed responses across providers with auto-retry on validation failures — a good default for application code where you want types and don't want to hand-write the repair loop. Outlines goes deeper: it does constrained decoding locally for open-source models (Llama, Mistral) where you don't have a server-side strict mode, using regex/CFG to mask the logits at each step. Reach for Outlines when you're self-hosting and need strict-mode-equivalent guarantees on a model that doesn't ship one. For frontier models, native strict mode + Instructor is usually enough.

How do you design a schema that the model gets right the first time?

Keep it shallow — two levels of nesting is the sweet spot, three is the limit. Use enums everywhere a field has bounded values; "status: pending|approved|rejected" is way more reliable than a free-form string. Add docstrings (the description field) on every property — the model reads them and uses them as in-context guidance. Mark every property required in strict mode and use nullable types for optionality. Avoid additionalProperties: true; the model fills it with garbage. Test the schema with 20 representative inputs before shipping — the failures show you which fields need clearer descriptions.

How do you evaluate a structured-output pipeline?

Validation pass-rate is the floor, not the ceiling — a response can be schema-valid and semantically wrong. I keep a labeled fixture set of (input, expected_structured_output) and score with field-level precision/recall: did extracted_amount match? Did parties contain the right names? For free-text fields inside the structure I use LLM-as-judge with a rubric. The validation failures themselves are gold — I dump them into the eval set so the next prompt iteration has to handle them. CI fails the build if pass-rate or field-F1 drops below baseline.

↑ Back to Top