Frontier Model Comparison (2026)

Picking a model in 2026 is no longer a single decision — it is a routing decision. Most production systems use two or three models: a small fast one for classification and reformulation, a mid-tier one for the bulk of traffic, and a frontier model for the hard 5% that needs deep reasoning, long context, or careful agentic tool use. This page lays out the current landscape, side-by-side specs and prices, where each model wins, what hosting options exist, and how to wire a router that escalates and falls back without rewriting your application.

Numbers below are hedged ("starts at", "as of 2026") because list prices, context windows, and even model names drift quarterly. Verify against the provider's pricing page before signing a vendor contract.



1. Model Landscape

One more framing point before the table: model choice is not a one-time decision. The frontier reshuffles every quarter (a new release here, a price cut there, an open-weight model crossing a quality threshold), and what was the obvious answer in January often is not the obvious answer by July. Treat your model selection as versioned, not static — the same way you treat your database engine version or your runtime version.

The 2026 frontier sits in three buckets. Frontier closed-weight models — Claude Opus 4.7, GPT-5, Gemini 2.x Pro — lead on multi-step reasoning, tool use, and long-context recall, served only via API or hyperscaler resale. Frontier open-weight models — Llama 3.3 70B, Mistral Large 2, Qwen 2.5 72B, DeepSeek V3 — have closed most of the quality gap on knowledge tasks and code, and you can self-host them on your own GPUs (or rent inference from Together, Fireworks, Groq, Bedrock). Smaller fast models — Claude Haiku 4.5, Gemini 2.x Flash, Llama 3.3 8B, GPT-5 mini — exist for high-throughput, latency-sensitive, or per-token cost-sensitive workloads where 90% quality at 5% cost is the right tradeoff.

The practical implication: a serious LLM application picks at least one model from bucket 1 or 2 for hard requests, plus one from bucket 3 for everything else. Routing between them is now an architectural concern, not a future optimization.

Two background trends frame everything below. First, the price-per-quality curve has dropped roughly 5x year-over-year since 2023; what cost $30 per million tokens for frontier output two years ago is now $3–$15. Second, open-weight models are within ~5–10 points of closed-weight on most benchmarks, which means the question "should we self-host?" now turns on operational and compliance economics rather than capability gaps.


2. Side-by-Side Comparison Table

Pricing is per million tokens, list price as of early 2026. Context and output windows are the published maximums; sustained throughput is usually lower.

Model Provider Context Output In $/1M Out $/1M Strengths Hosting Cutoff
Claude Opus 4.7Anthropic1M64K$15$75Long-doc reasoning, agentic tool use, codeAPI, Bedrock, VertexJan 2026
Claude Sonnet 4.6Anthropic1M64K$3$15Best price/perf for production RAG and chatAPI, Bedrock, VertexLate 2025
Claude Haiku 4.5Anthropic200K16K$0.80$4Low-latency classification, query rewriteAPI, Bedrock, VertexMid 2025
GPT-5OpenAI400K128Kstarts at $10starts at $40Code, math, structured output, visionAPI, AzureLate 2025
GPT-4o (legacy)OpenAI128K16K$2.50$10Mature ecosystem, strong general purposeAPI, AzureOct 2023
Gemini 2.x ProGoogle2M64Kstarts at $7starts at $21Largest context, native video, grounded searchVertex, AI StudioLate 2025
Gemini 2.x FlashGoogle1M64K$0.30$2.50Cheap long-context, batch summarizationVertex, AI StudioLate 2025
Llama 3.3 70BMeta (open)128K8K~$0.70 hosted~$0.80 hostedOpen weights, fine-tunable, strong codeSelf-host, Bedrock, Together, Fireworks, GroqDec 2023
Llama 3.3 8BMeta (open)128K8K~$0.10 hosted~$0.10 hostedEdge, on-device, embedded routingSelf-host, Bedrock, Together, GroqDec 2023
Mistral Large 2Mistral (open weights)128K16K$2$6European data residency, strong multilingualLa Plateforme, Bedrock, Azure, self-hostMid 2024
Qwen 2.5 72BAlibaba (open)128K8K~$0.90 hosted~$0.90 hostedStrongest open Chinese, solid code, mathSelf-host, DashScope, TogetherMid 2024
DeepSeek V3DeepSeek (open)128K8K$0.27$1.10MoE, very cheap, strong code, reasoning variantAPI, self-host (heavy)Mid 2024

Two things this table hides: cached input (Anthropic, Google, OpenAI all charge ~10% for cache hits, which dominates economics for long system prompts), and batch pricing (50% off on most providers if you can wait 24h). For high-volume offline workloads, batch + cache is often a 5–10x cost reduction over the headline number.

A second view, useful for quick mental anchoring — the rough price ratio of each model relative to the cheapest in the table (Llama 3.3 8B):

ModelOutput cost vs Llama 8B
Llama 3.3 8B1x (baseline)
Gemini 2.x Flash~25x
DeepSeek V3~11x
Llama 3.3 70B (hosted)~8x
Claude Haiku 4.5~40x
Mistral Large 2~60x
GPT-4o (legacy)~100x
Claude Sonnet 4.6~150x
Gemini 2.x Pro~210x
GPT-5~400x
Claude Opus 4.7~750x

This is the easiest way to argue for routing in a design review. Even if the smart model is "only" used on 5% of requests, that 5% can dominate cost if it's the most expensive tier — which is exactly why the escalation logic and the validator that triggers it deserve real engineering attention.


3. Capability Strengths by Use Case

Capability rankings turn over more than the table does. The honest framing per workload, as of early 2026:


4. Cost and Latency Tradeoffs (Worked Example)

Quality is hard to compare on a single axis; cost and latency are not. Once you have established that two models meet your quality bar, the choice between them collapses to total cost-of-ownership and SLA fit. The worked example below uses a representative RAG workload — most production systems sit somewhere within an order of magnitude of these inputs.

Scenario: a RAG service answers 100,000 questions. Each request averages 4,000 input tokens (system prompt + retrieved context) and 500 output tokens. Total per run: 400M input tokens and 50M output tokens. Cost per model at list price, no caching, no batch discount:

Model Input cost Output cost Total ~Latency p50
Claude Opus 4.7$6,000$3,750$9,7504–8s
Claude Sonnet 4.6$1,200$750$1,9502–4s
Claude Haiku 4.5$320$200$5200.5–1.5s
GPT-5$4,000$2,000$6,0003–7s
Gemini 2.x Pro$2,800$1,050$3,8503–6s
Gemini 2.x Flash$120$125$2450.5–1.5s
Llama 3.3 70B (Together)$280$40$3201–3s
Llama 3.3 8B (Groq)$40$5$450.1–0.4s
DeepSeek V3$108$55$1632–5s

The 200x gap between Opus and Llama 8B is precisely why routing matters. Most user questions can be answered by Sonnet or Flash; only a small fraction actually need Opus or GPT-5. Two cost levers crush the absolute number: prompt caching (the 4K system prompt gets cached, dropping input cost by ~85% on the second request) and batch (50% off if the answer can wait).

Latency numbers above are p50 for a non-streaming completion. Streaming changes the user-perceived number — time-to-first-token on Sonnet is around 400ms, on Haiku around 200ms, on Groq-hosted Llama 8B under 100ms. If your UX is conversational, optimize for TTFT and stream; if it is batch (summarize a folder of PDFs overnight), optimize for total tokens-per-second and use the batch API. These are different optimizations and they sometimes pull in opposite directions on model choice.

A simple cost calculator you can drop into a notebook to size a workload:


# cost.py — back-of-envelope LLM workload pricing.
PRICING = {
    # ($ per 1M input tok, $ per 1M output tok)
    "claude-opus-4-7":     (15.00, 75.00),
    "claude-sonnet-4-6":   ( 3.00, 15.00),
    "claude-haiku-4-5":    ( 0.80,  4.00),
    "gpt-5":               (10.00, 40.00),
    "gemini-2-pro":        ( 7.00, 21.00),
    "gemini-2-flash":      ( 0.30,  2.50),
    "llama-3-3-70b":       ( 0.70,  0.80),
    "llama-3-3-8b":        ( 0.10,  0.10),
    "deepseek-v3":         ( 0.27,  1.10),
}

def estimate(model, n_requests, in_tok, out_tok, cache_hit_rate=0.0, batch=False):
    p_in, p_out = PRICING[model]
    eff_in_price = p_in * (1 - 0.9 * cache_hit_rate)   # 90% off cached input
    cost = (n_requests * in_tok / 1e6) * eff_in_price \
         + (n_requests * out_tok / 1e6) * p_out
    return cost * (0.5 if batch else 1.0)

for m in PRICING:
    c = estimate(m, n_requests=100_000, in_tok=4000, out_tok=500,
                 cache_hit_rate=0.8, batch=False)
    print(f"{m:24s} ${c:8,.0f}")
  

Streaming-first applications should also track tokens-per-second not just total latency — a 5s response that streams smoothly feels different from a 5s wall of silence. Most providers expose streaming via the same SDK with a stream=True flag; the cost is identical to non-streaming.


# stream.py — measure TTFT and tokens/sec.
import time
from anthropic import Anthropic

client = Anthropic()

t0 = time.time()
ttft = None
tokens = 0
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain CRDTs in 200 words."}],
) as stream:
    for text in stream.text_stream:
        if ttft is None:
            ttft = time.time() - t0
        tokens += len(text.split())
total = time.time() - t0
print(f"TTFT: {ttft*1000:.0f}ms  total: {total:.2f}s  rate: {tokens/total:.1f} tok/s")
  

Two operational warnings on cost modeling:


5. Hosting Options

Where you can actually run each model matters as much as the model itself. Procurement, data-residency, and existing cloud commits often decide before quality does.

Model Native API AWS Bedrock Azure Vertex AI Self-host (vLLM / TGI)
Claude (all)YesYesNoYesNo
GPT-5 / GPT-4oYesNoYes (Azure OpenAI)NoNo
Gemini 2.xAI StudioNoNoYesNo
Llama 3.3Together / Fireworks / GroqYesYes (MaaS)Yes (MaaS)Yes
Mistral Large 2La PlateformeYesYes (MaaS)Yes (MaaS)Yes (open weights)
Qwen 2.5DashScopeNoNoNoYes
DeepSeek V3YesNoNoNoYes (heavy: 671B params)

Practical notes: Bedrock is the only place you get Claude, Llama, and Mistral on the same control plane (useful for compliance teams that want one audit surface). Azure OpenAI is the only enterprise route to GPT-5 and the path of least resistance if the rest of your stack is on Azure. Vertex AI is the only managed Gemini Pro option. For self-hosted Llama or Mistral, vLLM is the de-facto serving layer; for very high QPS on small models, Groq's LPU hardware is a category of its own on latency.

Hosting choice has second-order effects that show up months later:

A note on procurement: every hyperscaler offers provisioned throughput options (Bedrock Provisioned, Azure PTU, Vertex committed capacity) where you pre-purchase model capacity at a discount in exchange for a multi-month commitment. These look attractive in spreadsheets but are usually a trap unless you have firm baseline traffic — unused PTUs do not refund. Most teams should run on-demand for the first 3–6 months, get a real load profile, then commit only to the steady-state floor.

If you are evaluating Bedrock vs native API for Claude specifically, three subtle differences matter: (1) Bedrock applies its own region-specific quota that is separately negotiable from Anthropic's; (2) Bedrock supports cross-region inference profiles that auto-fail-over between regions for higher availability at no extra cost; (3) some Anthropic features (like extended thinking on the latest model) appear on the native API a few weeks before Bedrock. None of these are deal-breakers — just things to verify against your specific workload.


6. Routing Patterns

Two patterns cover most production needs:

LiteLLM is the simplest router. It exposes an OpenAI-compatible endpoint, fronts ~100 providers, and supports both fallback and per-key budget tracking out of the box.


# litellm-config.yaml — router config with escalation + fallback.
model_list:
  - model_name: cheap
    litellm_params:
      model: groq/llama-3.3-8b
      api_key: os.environ/GROQ_API_KEY
  - model_name: mid
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart-backup
    litellm_params:
      model: openai/gpt-5
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  fallbacks:
    - smart: ["smart-backup"]      # availability fallback
    - mid:   ["smart"]              # quality escalation handled in app code
  num_retries: 2
  timeout: 30
  routing_strategy: simple-shuffle
  redis_host: localhost            # for cross-process rate limit + budget state
  cache_responses: true
  

Application code can then call any provider through one OpenAI-shaped endpoint:


# router.py — escalate from mid to smart on validator failure.
from openai import OpenAI

client = OpenAI(base_url="http://localhost:4000", api_key="anything")

def answer(question: str) -> str:
    for tier in ("mid", "smart"):
        resp = client.chat.completions.create(
            model=tier,
            messages=[
                {"role": "system", "content": "Answer concisely. End with CONFIDENCE: HIGH or LOW."},
                {"role": "user",   "content": question},
            ],
        )
        text = resp.choices[0].message.content
        if "CONFIDENCE: HIGH" in text:
            return text.replace("CONFIDENCE: HIGH", "").strip()
    return text.replace("CONFIDENCE: LOW", "").strip()
  

Self-reported confidence is a weak signal on its own — pair it with a structural validator (was a citation produced? did the SQL parse? did the JSON validate?) for anything that matters.

For agentic systems specifically, a stronger escalation signal is tool-call disagreement: if two cheap models, given the same conversation history, choose different next tool calls, that is a high-information disagreement and worth a smart-model arbitration. Cheap on agreement (the common case), expensive only on actual ambiguity.


# escalate_on_tool_disagreement.py
def next_action(state):
    a = call_cheap("haiku-4-5",  state)   # returns {"tool": ..., "args": ...}
    b = call_cheap("gemini-flash", state)
    if a["tool"] == b["tool"] and a["args"] == b["args"]:
        return a                          # cheap path, ~95% of the time
    return call_smart("opus-4-7", state)  # arbitration on disagreement
  

One last consideration on routing: budget caps per request, per user, per tenant. The router is the right place to enforce them because the application doesn't always know the cost in advance. LiteLLM supports a max_budget on each virtual key; pair it with per-tenant Redis counters so a runaway loop on one customer cannot drain the shared budget for the rest.


# budget_guard.py — refuse the call if the tenant is at cap.
import redis, time

r = redis.Redis()

def check_and_charge(tenant_id: str, est_cost_usd: float, daily_cap: float) -> bool:
    key = f"budget:{tenant_id}:{time.strftime('%Y-%m-%d')}"
    spent = float(r.get(key) or 0)
    if spent + est_cost_usd > daily_cap:
        return False
    r.incrbyfloat(key, est_cost_usd)
    r.expire(key, 86_400 * 2)   # keep one day of slack for late reconciliation
    return True
  

Estimate cost with the cheaper-model price before the call (input tokens are knowable; output you cap at max_tokens); reconcile with actual usage after the response arrives. The pre-check is what prevents a single request from putting you 10x over budget — even off by 30% it's enough to keep cost incidents bounded.

One subtler routing pattern worth knowing: cascade with majority vote. Send a request to two cheap models in parallel; if they agree, return immediately; if they disagree, escalate to the smart model as tiebreaker. On classification-like tasks with a small label space this often beats a single mid-tier call on both quality and cost. The math: you pay 2x cheap on every request, plus 1x smart on the disagreement rate. If cheap-model agreement is 85% and the smart model is 25x the price of cheap, your effective cost is 2 + 0.15 * 25 = 5.75x cheap, well under a single mid-tier call at ~10x cheap. Worth modeling on your own data before adopting.

Also worth modeling: caching as a routing signal. If your system prompt is large and stable, the cache hit rate determines whether escalation is even worth it — escalating to a fresh model means re-paying for cache misses on the smart side. Some routers (LiteLLM included) support cache-aware fallback: prefer the model that already has the cache warm unless quality demands otherwise.


7. How to Choose

The decision is rarely "which model is best?" — it is "which model satisfies my hardest constraint, and which fallback covers the next-hardest?" Constraints come in this rough priority order in most enterprises: data residency > latency SLA > per-request cost ceiling > quality on hard requests > ecosystem integration. Lock the top constraint first.

A decision tree that fits on one screen:

  1. Is the data legally allowed to leave your network? No → self-host Llama 3.3 70B (or Qwen 2.5 / DeepSeek V3 if you have the GPU budget). Stop. (If "yes but only US-region", a hyperscaler-resold Claude or Llama on AWS/Azure/GCP US works too.)
  2. Does a single request exceed 200K tokens of context? Yes → Claude Opus/Sonnet (1M) or Gemini 2.x Pro (2M). Stop.
  3. Is the task agentic (≥3 tool calls per request)? Yes → Claude Opus 4.7 or GPT-5; use Sonnet for shorter loops.
  4. Is the task latency-critical (sub-second user-facing)? Yes → Haiku 4.5, Gemini Flash, or Llama 3.3 8B on Groq.
  5. Is this a high-volume background job (millions/day, no user waiting)? Yes → cheapest capable model + batch API. Usually Gemini Flash or DeepSeek V3.
  6. None of the above → default to Claude Sonnet 4.6 or GPT-5; this is the right starting point for ~80% of new applications.

Three patterns to avoid as starting points: (a) defaulting to GPT-4o because of "ecosystem familiarity" — the model is a generation behind on reasoning and 3x the price of Sonnet for similar quality; (b) starting on Opus because "best is best" — you'll burn cash for months before you realize Sonnet covered the same workload at 1/5 the cost; (c) starting on a self-hosted Llama before you have measured demand — GPU sunk cost is hard to walk back, and a managed API gives you 6 months of free elasticity.

Whatever you pick, build the routing seam from day one. Hardcoding a single model name in your application is the most expensive shortcut in LLM engineering — re-pricing or re-availability happens on the provider's schedule, not yours.

One last piece of pragmatism: the same code path should work across at least two providers from launch. The OpenAI-compatible chat completions schema is the de-facto lingua franca; LiteLLM, Together, Fireworks, Groq, vLLM, and Anthropic (via a thin adapter) all speak it. Stay on this shape and provider migration becomes a config change, not a refactor.


# providers.py — one client interface, three providers, one swap.
from openai import OpenAI

PROVIDERS = {
    "anthropic": OpenAI(base_url="https://api.anthropic.com/v1/", api_key=ANTHROPIC_KEY),
    "openai":    OpenAI(api_key=OPENAI_KEY),
    "together":  OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_KEY),
}

def chat(provider: str, model: str, messages: list[dict]) -> str:
    resp = PROVIDERS[provider].chat.completions.create(
        model=model, messages=messages, max_tokens=1024,
    )
    return resp.choices[0].message.content

# Same call shape, three different backends:
chat("anthropic", "claude-sonnet-4-6",  msgs)
chat("openai",    "gpt-5",              msgs)
chat("together",  "meta-llama/Llama-3.3-70B-Instruct-Turbo", msgs)
  

8. Common Pitfalls


9. Common Interview Q&A

When would you choose Claude Sonnet over Claude Opus in production?

For the bulk of RAG and chat traffic. Sonnet is roughly 1/5 the price and 2x faster, with quality close enough to Opus on summarization, classification, and short-loop tool use. Reserve Opus for the requests that fail a Sonnet validator — long-document analysis, agent loops over ~5 steps, or complex code edits across multiple files.

What is the cheapest way to cut your Anthropic or OpenAI bill in half without changing models?

Prompt caching plus batch. Caching a 4K system prompt drops repeated-input cost by ~85%; the batch API takes another 50% off everything if you can wait up to 24 hours. Together those routinely produce 5–10x reductions on workloads with stable system prompts and no real-time SLA. Only after that should you consider routing to a smaller model.

Why might you self-host Llama 3.3 70B even when Claude Sonnet is cheaper per token?

Three reasons: data cannot leave your network (regulatory or contractual), unit economics at very high QPS where amortized GPU cost beats per-token pricing, or the need to fine-tune on proprietary data. None of these is "Llama is better" — they are operational constraints that closed APIs cannot satisfy.

How would you set up a fallback when your primary provider rate-limits you?

Put LiteLLM (or an equivalent router) in front of two providers — for example Claude Sonnet 4.6 on Anthropic primary, GPT-5 on Azure backup. Configure automatic retry on 429 and 5xx with the second provider. Keep prompts and tool schemas neutral enough to work on both; track per-provider success rate and cost so you can shift the default if one degrades.

You need to summarize 50,000 1-page PDFs nightly. Which model and why?

Gemini 2.x Flash or DeepSeek V3 with the batch API. The job is offline (no latency SLA), per-document cost dominates, and 1-page PDFs do not need frontier reasoning. Either model at batch pricing comes in well under $50 for the run; Flash gets you native PDF/image input out of the box.

Why can context window not be compared on the spec sheet alone?

Published max context is the input limit, not the recall guarantee. Models degrade differently as context fills — needle-in-a-haystack accuracy at 800K is much lower than at 80K for most models, and hallucination on multi-document synthesis rises with input size. Always benchmark on your own retrieval-augmented documents at the size you actually expect to send, not at the headline number.


↑ Back to Top