vLLM and Quantization

Self-hosting open-weights LLMs in 2026 is a different problem than it was in 2023. Llama 3.x and Qwen 2.x match or beat hosted GPT-3.5-class models on most tasks, vLLM is the de facto serving stack for GPUs, and four quantization formats (GGUF, AWQ, GPTQ, FP8) cover almost every deployment from a Mac mini to an H100 cluster. This page covers what each layer does, how to launch a serving stack, and how to choose between them.

1. vLLM Architecture

vLLM (Kwon et al., SOSP 2023) brought two ideas to LLM serving that the rest of the field then copied:

PagedAttention: the KV cache is allocated in fixed-size blocks (like OS virtual memory pages) rather than as one contiguous tensor per sequence. This eliminates internal fragmentation and lets vLLM serve far more concurrent sequences from the same GPU memory.
Continuous batching: instead of static batches of N sequences that all start and finish together (and waste compute when one finishes early), vLLM swaps new sequences into a slot the moment an old one completes. Average GPU utilization on a steady stream of requests goes from ~30% (static batching) to ~80%+.

Add to that: tensor parallelism (split a model across multiple GPUs), pipeline parallelism (split layers across nodes), chunked prefill (overlap prompt processing with generation of other requests), prefix caching (reuse KV state for shared system prompts), and speculative decoding (a small draft model proposes tokens, the big model verifies in parallel).

2. Launching vLLM with Llama 3 or Qwen 2


# Single-GPU launch — Llama 3.1 8B Instruct, fp16, on one A10G or 4090.
pip install "vllm>=0.6"

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching


# Larger model, two GPUs with tensor parallelism — Qwen 2.5 72B AWQ on 2x A100 80GB.
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Useful flags worth knowing:

--max-num-seqs N — concurrency cap per GPU; raise for throughput, lower for tail latency.
--max-model-len — KV cache scales with this; setting it conservatively buys you more concurrent requests.
--swap-space N — GiB of CPU RAM vLLM may use to swap KV blocks under pressure.
--dtype bfloat16 — default on Ampere+; float16 is fine on older GPUs.

3. OpenAI-Compatible API

vLLM exposes the OpenAI Chat Completions API on the same port. Any client that talks to OpenAI talks to vLLM unchanged — point base_url at your server.


from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a SQL query to find duplicate emails in users."}],
    temperature=0.2,
    max_tokens=400,
)
print(resp.choices[0].message.content)

Tool use (function calling) works too if you launch with --enable-auto-tool-choice --tool-call-parser llama3_json (or hermes, mistral, depending on the model's tuning).


vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json

4. Quantization Formats: GGUF, AWQ, GPTQ, FP8

Format	Bits	Runtime	Strengths	Notes
GGUF	2-8 (Q4_K_M most common)	llama.cpp, Ollama, LM Studio	CPU + GPU + Apple Silicon; tiny memory footprint; single-file distribution.	Quality drop noticeable below Q4. Best for laptops and small servers.
AWQ	4 (activation-aware)	vLLM, TGI, AutoAWQ	Preserves quality of "important" weights; ~3x speedup vs fp16 on Ampere/Hopper.	Requires a calibration dataset; pre-quantized variants on HF for popular models.
GPTQ	3-4 (post-training)	vLLM, ExLlama, TGI	Mature; many community quants; works on most GPUs.	Slightly worse quality than AWQ at the same bit-width on most evals.
FP8 (E4M3/E5M2)	8	vLLM, TensorRT-LLM	Near-lossless; native Hopper (H100/H200) and Ada (L40, 4090) support; ~2x throughput vs bf16.	Pre-Hopper GPUs emulate FP8 in software — not a win on A100/A10G.
INT8 (SmoothQuant, LLM.int8)	8	vLLM, bitsandbytes	Wide hardware support; modest memory savings.	Largely superseded by FP8 on Hopper and AWQ on Ampere.

5. Picking a Format by Hardware

Mac (M-series), small CPU box → GGUF Q4_K_M via llama.cpp / Ollama. There is no GPU to feed; quality is acceptable; tooling is the friendliest.
Single consumer GPU (3090 / 4090, 24 GB) → GGUF (llama.cpp) for chat, or AWQ/GPTQ in vLLM if you need throughput and OpenAI-compatible API.
A10G / L4 (24 GB) on AWS/GCP → AWQ in vLLM. AWQ kernels are well-optimized for Ampere; FP8 is emulated and slower than AWQ here.
A100 80 GB → AWQ for memory savings, or bf16 for max quality if the model fits. FP8 emulation is not a win.
H100 / H200 → FP8 (native hardware support); near-lossless quality with the throughput of an int8 model.
L40S / 4090 with FP8 hardware → FP8 if the kernels are mature for your model; AWQ otherwise.

6. Throughput vs Quality Numbers

Approximate, drawn from public vLLM and llama.cpp benchmarks on Llama 3.1 8B, A100 80GB, batch 32, prompt 512 / generation 256:

Format	Throughput (tok/s/GPU, output)	Memory for weights	MMLU drop vs bf16
bf16	~3,500	~16 GB	0.0
FP8 (H100)	~6,500	~8 GB	~0.1-0.3 pp
AWQ 4-bit	~5,500	~5.5 GB	~0.5-1.0 pp
GPTQ 4-bit	~5,000	~5.5 GB	~0.7-1.5 pp
GGUF Q4_K_M	~150 (CPU) / ~1,800 (GPU offload)	~5 GB	~1-2 pp

Quality differences below ~1 percentage point are often invisible in domain-specific tasks; for general benchmarks they matter at the margins. Always re-run your own eval suite (see RAG Evaluation) — published deltas are starting points, not destinations.

7. Speculative Decoding

A small "draft" model proposes k tokens; the target model verifies them in a single forward pass. Accepted tokens are kept; rejected ones force a fallback. Net result: 1.5-3x lower latency at identical quality (verification is exact).


# Llama 3.1 70B as target, 8B as draft.
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --use-v2-block-manager

Speculation pays off most for code, structured output, and other token-predictable workloads where the small model agrees with the big one often. On free-form creative writing the win is smaller.

8. vLLM vs TGI vs SGLang vs llama.cpp

Server	Best at	Tradeoffs
vLLM	General-purpose GPU serving with OpenAI-compatible API; widest model coverage; strong AWQ/FP8/GPTQ kernels; the default choice.	Heavier dependency stack; CPU-only inference is not its goal.
HF TGI	Tight Hugging Face integration, used inside Inference Endpoints; production-grade telemetry.	Model coverage and feature velocity have lagged vLLM in 2025; Apache-2 license restored after the brief license change.
SGLang	Highest throughput on structured / agentic workloads; RadixAttention prefix caching is excellent; first-class JSON-grammar decoding.	Smaller community than vLLM; APIs evolve quickly.
llama.cpp / Ollama	CPU, Apple Silicon, and small-GPU deployments; dead-simple distribution; GGUF ecosystem.	Single-request throughput is fine; many-concurrent-request throughput is far below vLLM.
TensorRT-LLM	Squeezing the last 10-20% of throughput out of NVIDIA hardware in production.	Build complexity; per-model engine compilation; less flexible than vLLM.

9. Production Tuning

Pin --max-model-len to what you actually need. A 128k-context model launched with 128k will reserve KV cache for it and drop your concurrency through the floor.
Enable --enable-prefix-caching. System prompts and few-shot examples are reused across requests; prefix caching avoids re-prefilling them.
Enable --enable-chunked-prefill. Long prompts no longer stall short-prompt requests behind them.
Watch tail latency, not just average. Continuous batching trades P50 for higher P99 under load — set --max-num-seqs to enforce a ceiling.
One model per process. Multi-tenant vLLM with different models is possible (LoRA adapters) but mixing base models in one process is not.
Serve metrics on /metrics (Prometheus exposition). Track vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:time_to_first_token_seconds.
Cap request size at the gateway. A single 100k-token prompt can trash KV cache for everyone else; reject pathological inputs upstream.
Quantize, then re-eval. The published "near-lossless" quants are near-lossless on average; your specific task may regress more.

Common Interview Questions:

What is PagedAttention and why does it matter?

PagedAttention treats the KV cache like virtual memory: instead of allocating a contiguous block per request sized for max-context (which wastes >60% of GPU memory on padding), it allocates fixed-size blocks (typically 16 tokens) and maintains a per-request page table mapping logical token positions to physical blocks. The win: near-zero internal fragmentation, so you can hold 2–4x more concurrent requests in the same GPU memory. It also enables cheap prefix sharing — two requests with the same system prompt point to the same physical pages. PagedAttention is the core innovation that lets vLLM hit 2–24x throughput vs naive HuggingFace serving.

Walk me through AWQ vs GPTQ vs FP8 vs GGUF.

AWQ (activation-aware) and GPTQ are both 4-bit weight-only quantization for GPU inference; AWQ tends to preserve quality better on instruction-tuned models, GPTQ is older and broadly supported. Both keep activations in fp16, so memory savings are on weights only. FP8 is 8-bit floating-point (vs int8) supported natively on H100/H200/Blackwell — less aggressive compression than 4-bit but near-lossless and fast because the hardware has dedicated FP8 tensor cores. GGUF is the llama.cpp format, primarily for CPU and Apple Silicon inference, supporting 2–8 bit mixed-precision with k-quants; not used in vLLM/GPU server stacks. Rule of thumb: H100 in production = FP8; A100/L40S = AWQ 4-bit; laptop = GGUF Q4_K_M.

How do you compute throughput for a vLLM deployment?

Two numbers matter: prefill throughput (input tokens/sec, bottlenecked by compute) and decode throughput (output tokens/sec, bottlenecked by memory bandwidth). Decode is the constraint for long generations. Rough math: peak decode tokens/sec ≈ (HBM_bandwidth_GB/s) / (model_size_GB) per concurrent stream, so a 70B model at fp16 (140 GB) on an H100 (3.35 TB/s) caps around 24 tok/s per stream — quantize to AWQ 4-bit (~35 GB) and you hit ~95 tok/s per stream. Continuous batching multiplies aggregate throughput by the number of concurrent streams the KV cache fits, often 16–64x.

When does it make sense to self-host vs use a frontier API?

Self-host wins on three axes: data residency (regulated data that can't leave your VPC), unit economics at scale (~>1B tokens/month is the rough crossover where API spend exceeds GPU rental), and customization (you've fine-tuned and need that exact checkpoint). Frontier API wins on quality (Sonnet/GPT-5 still beat any open model on hard reasoning), capacity elasticity (no GPU waitlists), and ops burden (no node failures at 3am). Most teams should start on the API and only self-host the specific high-volume, low-complexity paths once the bill justifies the engineering investment.

What metrics do you watch in production vLLM?

From the /metrics endpoint: vllm:num_requests_running and vllm:num_requests_waiting tell you queue pressure; vllm:gpu_cache_usage_perc warns you before the KV cache OOMs and starts preempting; vllm:time_to_first_token_seconds P99 catches prefill regressions; vllm:time_per_output_token_seconds catches decode regressions. Layer on standard infra metrics (GPU util, HBM util, NCCL errors on multi-GPU). Set an alert on KV cache >90% — that's where tail latency explodes.

How do you safely roll out a new quantized model?

Quantization is "near-lossless on average" — your specific task may regress more. The rollout plan: run your task-specific eval suite (RAGAS for RAG, exact-match for extraction, etc.) on the quantized model and compare to fp16 baseline; ship to a canary serving 1–5% of traffic with side-by-side logging; compare latency and quality metrics for 24–48 hours; ramp gradually. Cap request size at the gateway because a single 100k-token prompt can trash the KV cache for everyone; reject pathological inputs upstream. Always keep the fp16 model warm on a fallback pod for instant rollback.

↑ Back to Top