Week 11: vLLM High-Throughput Inference Optimization

Phase 4Week 11AdvancedLecture: 2026-05-12

Theory

Learning Objectives

Concepts

Explain in one sentence the bottleneck each optimization solves: PagedAttention, prefix caching, chunked prefill, speculative decoding, and disaggregated prefill.

Design

Diagnose a workload (chat / batch / RAG / agent) and decide, in a decision table, which option to enable first and why.

Implementation

Bring up single-GPU, multi-GPU, and MIG-isolated servers with vllm serve, capture latency and throughput from an OpenAI-compatible client, and implement an admission-control queue in code.

Operations

Expose the five core indicators (TTFT, TPOT, queue time, cache hit ratio, error rate) to Prometheus/Grafana and report quantitative differences before and after each tuning change.

The inference server is the OS of the model

Deploying an LLM is not the same as starting the model file. The serving layer is what actually defines the user-visible quality.

Operational problem	What the inference server does
GPU memory pressure	KV cache management, quantization, tensor parallelism
Request latency	continuous batching, chunked prefill, speculative decoding
Repeated prompt cost	automatic prefix caching
Long-input flooding	prefill/decode separation, max-context policy
Failure analysis	metrics, traces, request logs, token accounting

vLLM wraps these problems behind an OpenAI-compatible API so the agent harness keeps the same interface even when models are swapped.

PagedAttention: vLLM’s starting point

Traditional inference reserves a large contiguous block of KV cache. When request lengths differ, the unused regions become dead memory; when long inputs mix with short ones, fragmentation grows quickly. vLLM’s PagedAttention treats KV cache like OS virtual memory: it splits the cache into blocks and links them as needed.

Traditional[KV cache: reserve max length]short requests still pay full cost

PagedAttention[block] [block] [block]link only the blocks you need

Memory mapping at a glance

PagedAttention KV Cache Mapping

Logical KV cache (per-sequence)

seq A — t0..t511two blocks (block 0, 1)

seq B — t0..t1023three blocks (block 2, 3, 4)

seq C — t0..t127one block (block 5)

Physical GPU memory blocks (16-token pages)

block 0→ seq A

block 1→ seq A

block 2→ seq B

block 3→ seq B

block 4→ seq B

block 5→ seq C

block 6free

A page table maps logical sequences to whichever physical blocks are free. Fragmentation effectively disappears because new requests can grab any free block.

PagedAttention is the foundation. The 2026 operations question is not “do we use PagedAttention” but which optimization to enable for which workload.

Five operational levers

Automatic Prefix Caching

Reuses the KV cache of repeated system prompts, tool schemas, and AGENTS.md/CLAUDE.md prefixes. Effective for workloads with long static prefixes such as Ralph loops.

Chunked Prefill

Splits long prefills into small chunks and interleaves them with decode requests, preventing one long input from blocking the queue.

Speculative Decoding

A small draft model or n-gram speculation proposes candidate tokens that the large model verifies. Goal is reduced latency.

Disaggregated Prefill

Separates prefill and decode into different workers/GPUs. Helpful for production workloads mixing long inputs with short outputs.

Structured Outputs

Enforces JSON schema, tool calls, and reasoning outputs at the serving layer, reducing parser failures in the agent harness.

A comparison of the five levers

Lever	Primary effect	Biggest risk	Indicator
Prefix Caching	lower TTFT, less prefill cost	hit rate ~0% if prompts vary slightly	cache hit ratio, TTFT
Chunked Prefill	mitigates head-of-line blocking from long inputs	overhead grows with overly small chunks	p95 latency, queue time
Speculative Decoding	lower average TPOT	weak draft model causes rejects and slowdown	acceptance rate, TPOT
Disaggregated Prefill	higher throughput on mixed long/short workloads	extra inter-node communication cost	throughput, network util
Structured Outputs	fewer parser failures	some schemas hurt sampling quality	invalid JSON rate, score

Prefix caching scenarios

Scenario	Static prefix	Dynamic part	Expected hit	TTFT impact
Ralph loop with the same PROMPT.md	long (3-8K)	small turn diff	70-95%	very large
Multi-tenant chatbot with shared system prompt	medium (500-2K)	per-user message	30-60%	moderate
RAG with new chunks each time	tiny	most of the prompt	0-15%	almost none
Repeated analysis on the same codebase	very long (10K+)	only the diff	80-99%	very large

Speculative decoding sequence diagram

Speculative Decoding Sequence

Client

Target (large)

Draft (small)

① Client → Target: prompt

▼

until stop token

② Target → Draft: speculate next k tokens

▼

③ Draft → Target: candidate tokens [t1..tk]

▼

④ Target verifies all k tokens (one forward pass)

▼

accept all kTarget → Client: emit k tokens

reject at jemit [t1..tj] + 1 corrected

The more accurate the draft, the more tokens are accepted at once and the lower the TPOT. A poor draft turns verification into pure overhead and can make things slower.

Disaggregated prefill architecture

Disaggregated Prefill Architecture

Frontend — Load Balancerrequest ingress, response egress

▼ fan out to prefill

Prefill Worker 1long-input KV generation

Prefill Worker 2long-input KV generation

▼ KV cache handoff

KV Cache Storeshort-lived cache, fast lookup for decode workers

▼ route to decode

Decode Worker 1short-output token generation

Decode Worker 2short-output token generation

Decode Worker 3short-output token generation

▲ responses back to LB

Long inputs land on prefill workers; only the KV cache is forwarded to decode workers. This raises utilization in production where long inputs and short outputs mix.

Choosing optimizations by workload

Turning everything on at once is not optimization. Diagnose the bottleneck first, then enable one lever at a time.

Workload	Primary bottleneck	First lever to enable	What to measure
Long repository context + small edits	prefill cost	prefix caching, chunked prefill	TTFT, cache hit ratio
Many short queries	batching efficiency	continuous batching, tuned max_num_seqs	throughput, queue time
Mixed long-doc summarization and short coding	long prefill blocking	chunked prefill, disaggregated prefill	p95 latency, queue time
Interactive coding assistant	first-token delay	speculative decoding, smaller max_tokens	TTFT, perceived latency
Strict JSON / tool output	parser failure	structured outputs, tool parser settings	invalid JSON rate
Ralph-loop capstone	repeated prefill	prefix caching + chunked prefill	TTFT, cache hit

Lab reports must capture not “we enabled X” but how the bottleneck indicator changed before and after.

Example vLLM server configurations

vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --served-model-name local-coder \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --port 8000

vllm serve Qwen/Qwen3-Coder-Next \
  --served-model-name qwen3-coder \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --port 8000

vllm serve zai-org/GLM-5.1 \
  --served-model-name glm-5.1 \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice \
  --port 8000

vllm serve Qwen/Qwen3-Coder-Next \
  --served-model-name qwen3-coder \
  --speculative-model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --num-speculative-tokens 5 \
  --port 8000

Throughput and latency are read together

A common tuning mistake is to look only at tokens/sec. Agent systems require the following indicators side by side.

Indicator	Meaning	Bad signal
TTFT (Time to First Token)	latency to the first response token	long prefills blocking the queue
TPOT (Time per Output Token)	inter-token interval	low decode-stage GPU utilization
Throughput	tokens generated per second	poor batching or memory fragmentation
Queue time	how long requests waited	excess concurrency, missing admission control
Cache hit ratio	prefix reuse rate	prompt assembly differs each turn
Error rate	timeouts or failures	max_model_len, OOM, parser failure

Prometheus/Grafana panel examples

vLLM exposes /metrics by default. Build the headline panels for the capstone with the following PromQL.

# Panel 1: TTFT p95 by model
histogram_quantile(0.95,
  sum by (le, model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))

# Panel 2: throughput (tokens/sec)
sum by (model_name) (rate(vllm:generation_tokens_total[1m]))

# Panel 3: prefix cache hit ratio
sum(rate(vllm:cache_hit_tokens_total[5m]))
  /
sum(rate(vllm:prompt_tokens_total[5m]))

# Panel 4: queue depth (active vs waiting)
vllm:num_requests_running
vllm:num_requests_waiting

Annotate each option-flip event on the panel so reviewers can see the before/after directly.

Admission control and context budget

Accepting more requests when the server slows down breaks every team’s lab. Production serving makes a decision before each request is admitted.

Control	Example policy
Max input length	course-level cap below `max_model_len`
Concurrent requests	per-team queue plus a global queue
Priority	demo and replay outrank experimental batches
Timeout	log prefill timeout and decode timeout separately
Fallback	reroute to a commercial API or a smaller model on failure

# admission.py — per-team queue and token budget
import asyncio, time
from dataclasses import dataclass

@dataclass
class Budget:
    max_concurrent: int = 4
    max_prompt_tokens: int = 16000
    timeout_s: float = 60.0

team_locks: dict[str, asyncio.Semaphore] = {}

async def admit(team: str, prompt_tokens: int, budget: Budget) -> bool:
    if prompt_tokens > budget.max_prompt_tokens:
        return False
    sem = team_locks.setdefault(team, asyncio.Semaphore(budget.max_concurrent))
    try:
        await asyncio.wait_for(sem.acquire(), timeout=budget.timeout_s)
        return True
    except asyncio.TimeoutError:
        return False

async def release(team: str):
    team_locks[team].release()

In Agent OS Runtime terms, admission control is also a policy gate. A rejection is not always a failure — it can be a successful boundary-protection event.

Multi-tenant serving design

In a DGX H100 lab shared by student teams, isolating resources beats a single shared model.

# Team A: 1g.10gb slice, training-grade model
CUDA_VISIBLE_DEVICES=MIG-GPU-a vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --served-model-name team-a-coder \
  --max-model-len 16384 \
  --port 8001

# Team B: 3g.40gb slice, larger model
CUDA_VISIBLE_DEVICES=MIG-GPU-b vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --served-model-name team-b-coder \
  --max-model-len 32768 \
  --port 8002

Observability first

vLLM ships Prometheus/Grafana and OpenTelemetry examples. Week 12 covers them in depth, but Week 11 labs should already log at minimum:

{
  "request_id": "run-20260512-001",
  "model": "local-coder",
  "prompt_tokens": 1842,
  "completion_tokens": 419,
  "ttft_ms": 820,
  "tpot_ms": 34,
  "cache_hit": true,
  "finish_reason": "stop"
}

This format aligns with the Agent OS Runtime .events.jsonl. The serving layer records tokens and latency; the agent runtime records tool calls, approvals, test results, and replay state.

Practicum

Bring up the baseline server

vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --served-model-name local-coder \
  --max-model-len 32768 \
  --port 8000

Compare prefix caching on/off

Run 20 requests sharing the same system prompt and repository summary. Compare TTFT, throughput, and cache hit between --enable-prefix-caching off and on.
Compare chunked prefill on/off

Mix long file summarization (20K+ tokens) with short code-generation requests. Measure how much the long requests delay the short ones.

Write an OpenAI-compatible client

from openai import OpenAI
import time

client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

started = time.perf_counter()
response = client.chat.completions.create(
    model="local-coder",
    messages=[{"role": "user", "content": "Write a sample pytest fixture."}],
    temperature=0.2,
)
elapsed = time.perf_counter() - started
print(elapsed, response.choices[0].message.content[:200])

Add an admission-control queue

Apply the admission.py pattern to limit per-team concurrency. Count rejected requests in a separate metric (admission.rejected).
Prepare the dashboard

Log ttft_ms, tokens/sec, error_rate, and cache_hit to Prometheus/Grafana or a simple CSV. Week 12 wires these values to OpenTelemetry traces.

Operations checklist

Item	Pass criterion
Model loading	cold start and warm restart times recorded
API compatibility	OpenAI client can complete a chat call
Stability	0-2 timeouts/OOMs out of 100 requests
Cost	comparison table of GPU-hour cost vs. API cost
Observability	at least request_id, model, tokens, latency logged
Security	external network/file access permissions separated
Admission control	per-team queue isolated, rejection events logged

Looking ahead

Week 12 builds telemetry, an event store, and the LLM-as-Judge gate on top of this vLLM server. Week 11 is “how to generate fast.” Week 12 is “how to run the generated results so you can trust them.”

Key Takeaways

The inference server is the model’s OS: model weights alone do not produce a service. KV cache, batching, queues, and observability must be co-designed.
PagedAttention is the floor: it nearly eliminates fragmentation, but the real operational decision is which optimization sits on top.
Evaluate the five levers separately: prefix caching, chunked prefill, speculative decoding, disaggregated prefill, and structured outputs each address a different bottleneck.
Diagnose the workload first: do not flip switches blindly — measure TTFT, TPOT, queue time, and cache hit before tuning.
Throughput alone is a trap: the agent UX is governed by TTFT, p95 latency, and queue time.
Admission control is a policy gate: a rejected request is a boundary-protection event, not a failure.
Multi-tenancy is solved by boundaries: MIG slices, separate ports, and per-team model names keep one team’s mistake from breaking the lab.