vLLM and Quantization

Self-hosting open-weights LLMs in 2026 is a different problem than it was in 2023. Llama 3.x and Qwen 2.x match or beat hosted GPT-3.5-class models on most tasks, vLLM is the de facto serving stack for GPUs, and four quantization formats (GGUF, AWQ, GPTQ, FP8) cover almost every deployment from a Mac mini to an H100 cluster. This page covers what each layer does, how to launch a serving stack, and how to choose between them.



1. vLLM Architecture

vLLM (Kwon et al., SOSP 2023) brought two ideas to LLM serving that the rest of the field then copied:

Add to that: tensor parallelism (split a model across multiple GPUs), pipeline parallelism (split layers across nodes), chunked prefill (overlap prompt processing with generation of other requests), prefix caching (reuse KV state for shared system prompts), and speculative decoding (a small draft model proposes tokens, the big model verifies in parallel).


2. Launching vLLM with Llama 3 or Qwen 2


# Single-GPU launch — Llama 3.1 8B Instruct, fp16, on one A10G or 4090.
pip install "vllm>=0.6"

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching
  

# Larger model, two GPUs with tensor parallelism — Qwen 2.5 72B AWQ on 2x A100 80GB.
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill
  

Useful flags worth knowing:


3. OpenAI-Compatible API

vLLM exposes the OpenAI Chat Completions API on the same port. Any client that talks to OpenAI talks to vLLM unchanged — point base_url at your server.


from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a SQL query to find duplicate emails in users."}],
    temperature=0.2,
    max_tokens=400,
)
print(resp.choices[0].message.content)
  

Tool use (function calling) works too if you launch with --enable-auto-tool-choice --tool-call-parser llama3_json (or hermes, mistral, depending on the model's tuning).


vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json
  

4. Quantization Formats: GGUF, AWQ, GPTQ, FP8

FormatBitsRuntimeStrengthsNotes
GGUF 2-8 (Q4_K_M most common) llama.cpp, Ollama, LM Studio CPU + GPU + Apple Silicon; tiny memory footprint; single-file distribution. Quality drop noticeable below Q4. Best for laptops and small servers.
AWQ 4 (activation-aware) vLLM, TGI, AutoAWQ Preserves quality of "important" weights; ~3x speedup vs fp16 on Ampere/Hopper. Requires a calibration dataset; pre-quantized variants on HF for popular models.
GPTQ 3-4 (post-training) vLLM, ExLlama, TGI Mature; many community quants; works on most GPUs. Slightly worse quality than AWQ at the same bit-width on most evals.
FP8 (E4M3/E5M2) 8 vLLM, TensorRT-LLM Near-lossless; native Hopper (H100/H200) and Ada (L40, 4090) support; ~2x throughput vs bf16. Pre-Hopper GPUs emulate FP8 in software — not a win on A100/A10G.
INT8 (SmoothQuant, LLM.int8) 8 vLLM, bitsandbytes Wide hardware support; modest memory savings. Largely superseded by FP8 on Hopper and AWQ on Ampere.

5. Picking a Format by Hardware


6. Throughput vs Quality Numbers

Approximate, drawn from public vLLM and llama.cpp benchmarks on Llama 3.1 8B, A100 80GB, batch 32, prompt 512 / generation 256:

FormatThroughput (tok/s/GPU, output)Memory for weightsMMLU drop vs bf16
bf16~3,500~16 GB0.0
FP8 (H100)~6,500~8 GB~0.1-0.3 pp
AWQ 4-bit~5,500~5.5 GB~0.5-1.0 pp
GPTQ 4-bit~5,000~5.5 GB~0.7-1.5 pp
GGUF Q4_K_M~150 (CPU) / ~1,800 (GPU offload)~5 GB~1-2 pp

Quality differences below ~1 percentage point are often invisible in domain-specific tasks; for general benchmarks they matter at the margins. Always re-run your own eval suite (see RAG Evaluation) — published deltas are starting points, not destinations.


7. Speculative Decoding

A small "draft" model proposes k tokens; the target model verifies them in a single forward pass. Accepted tokens are kept; rejected ones force a fallback. Net result: 1.5-3x lower latency at identical quality (verification is exact).


# Llama 3.1 70B as target, 8B as draft.
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --use-v2-block-manager
  

Speculation pays off most for code, structured output, and other token-predictable workloads where the small model agrees with the big one often. On free-form creative writing the win is smaller.


8. vLLM vs TGI vs SGLang vs llama.cpp

ServerBest atTradeoffs
vLLM General-purpose GPU serving with OpenAI-compatible API; widest model coverage; strong AWQ/FP8/GPTQ kernels; the default choice. Heavier dependency stack; CPU-only inference is not its goal.
HF TGI Tight Hugging Face integration, used inside Inference Endpoints; production-grade telemetry. Model coverage and feature velocity have lagged vLLM in 2025; Apache-2 license restored after the brief license change.
SGLang Highest throughput on structured / agentic workloads; RadixAttention prefix caching is excellent; first-class JSON-grammar decoding. Smaller community than vLLM; APIs evolve quickly.
llama.cpp / Ollama CPU, Apple Silicon, and small-GPU deployments; dead-simple distribution; GGUF ecosystem. Single-request throughput is fine; many-concurrent-request throughput is far below vLLM.
TensorRT-LLM Squeezing the last 10-20% of throughput out of NVIDIA hardware in production. Build complexity; per-model engine compilation; less flexible than vLLM.

9. Production Tuning



Common Interview Questions:

What is PagedAttention and why does it matter?

PagedAttention treats the KV cache like virtual memory: instead of allocating a contiguous block per request sized for max-context (which wastes >60% of GPU memory on padding), it allocates fixed-size blocks (typically 16 tokens) and maintains a per-request page table mapping logical token positions to physical blocks. The win: near-zero internal fragmentation, so you can hold 2–4x more concurrent requests in the same GPU memory. It also enables cheap prefix sharing — two requests with the same system prompt point to the same physical pages. PagedAttention is the core innovation that lets vLLM hit 2–24x throughput vs naive HuggingFace serving.

Walk me through AWQ vs GPTQ vs FP8 vs GGUF.

AWQ (activation-aware) and GPTQ are both 4-bit weight-only quantization for GPU inference; AWQ tends to preserve quality better on instruction-tuned models, GPTQ is older and broadly supported. Both keep activations in fp16, so memory savings are on weights only. FP8 is 8-bit floating-point (vs int8) supported natively on H100/H200/Blackwell — less aggressive compression than 4-bit but near-lossless and fast because the hardware has dedicated FP8 tensor cores. GGUF is the llama.cpp format, primarily for CPU and Apple Silicon inference, supporting 2–8 bit mixed-precision with k-quants; not used in vLLM/GPU server stacks. Rule of thumb: H100 in production = FP8; A100/L40S = AWQ 4-bit; laptop = GGUF Q4_K_M.

How do you compute throughput for a vLLM deployment?

Two numbers matter: prefill throughput (input tokens/sec, bottlenecked by compute) and decode throughput (output tokens/sec, bottlenecked by memory bandwidth). Decode is the constraint for long generations. Rough math: peak decode tokens/sec ≈ (HBM_bandwidth_GB/s) / (model_size_GB) per concurrent stream, so a 70B model at fp16 (140 GB) on an H100 (3.35 TB/s) caps around 24 tok/s per stream — quantize to AWQ 4-bit (~35 GB) and you hit ~95 tok/s per stream. Continuous batching multiplies aggregate throughput by the number of concurrent streams the KV cache fits, often 16–64x.

When does it make sense to self-host vs use a frontier API?

Self-host wins on three axes: data residency (regulated data that can't leave your VPC), unit economics at scale (~>1B tokens/month is the rough crossover where API spend exceeds GPU rental), and customization (you've fine-tuned and need that exact checkpoint). Frontier API wins on quality (Sonnet/GPT-5 still beat any open model on hard reasoning), capacity elasticity (no GPU waitlists), and ops burden (no node failures at 3am). Most teams should start on the API and only self-host the specific high-volume, low-complexity paths once the bill justifies the engineering investment.

What metrics do you watch in production vLLM?

From the /metrics endpoint: vllm:num_requests_running and vllm:num_requests_waiting tell you queue pressure; vllm:gpu_cache_usage_perc warns you before the KV cache OOMs and starts preempting; vllm:time_to_first_token_seconds P99 catches prefill regressions; vllm:time_per_output_token_seconds catches decode regressions. Layer on standard infra metrics (GPU util, HBM util, NCCL errors on multi-GPU). Set an alert on KV cache >90% — that's where tail latency explodes.

How do you safely roll out a new quantized model?

Quantization is "near-lossless on average" — your specific task may regress more. The rollout plan: run your task-specific eval suite (RAGAS for RAG, exact-match for extraction, etc.) on the quantized model and compare to fp16 baseline; ship to a canary serving 1–5% of traffic with side-by-side logging; compare latency and quality metrics for 24–48 hours; ramp gradually. Cap request size at the gateway because a single 100k-token prompt can trash the KV cache for everyone; reject pathological inputs upstream. Always keep the fp16 model warm on a fallback pod for instant rollback.

↑ Back to Top