Concepts
Explain in one sentence the bottleneck each optimization solves: PagedAttention, prefix caching, chunked prefill, speculative decoding, and disaggregated prefill.
Concepts
Explain in one sentence the bottleneck each optimization solves: PagedAttention, prefix caching, chunked prefill, speculative decoding, and disaggregated prefill.
Design
Diagnose a workload (chat / batch / RAG / agent) and decide, in a decision table, which option to enable first and why.
Implementation
Bring up single-GPU, multi-GPU, and MIG-isolated servers with vllm serve, capture latency and throughput from an OpenAI-compatible client, and implement an admission-control queue in code.
Operations
Expose the five core indicators (TTFT, TPOT, queue time, cache hit ratio, error rate) to Prometheus/Grafana and report quantitative differences before and after each tuning change.
Deploying an LLM is not the same as starting the model file. The serving layer is what actually defines the user-visible quality.
| Operational problem | What the inference server does |
|---|---|
| GPU memory pressure | KV cache management, quantization, tensor parallelism |
| Request latency | continuous batching, chunked prefill, speculative decoding |
| Repeated prompt cost | automatic prefix caching |
| Long-input flooding | prefill/decode separation, max-context policy |
| Failure analysis | metrics, traces, request logs, token accounting |
vLLM wraps these problems behind an OpenAI-compatible API so the agent harness keeps the same interface even when models are swapped.
Traditional inference reserves a large contiguous block of KV cache. When request lengths differ, the unused regions become dead memory; when long inputs mix with short ones, fragmentation grows quickly. vLLM’s PagedAttention treats KV cache like OS virtual memory: it splits the cache into blocks and links them as needed.
A page table maps logical sequences to whichever physical blocks are free. Fragmentation effectively disappears because new requests can grab any free block.
PagedAttention is the foundation. The 2026 operations question is not “do we use PagedAttention” but which optimization to enable for which workload.
Automatic Prefix Caching
Reuses the KV cache of repeated system prompts, tool schemas, and AGENTS.md/CLAUDE.md prefixes. Effective for workloads with long static prefixes such as Ralph loops.
Chunked Prefill
Splits long prefills into small chunks and interleaves them with decode requests, preventing one long input from blocking the queue.
Speculative Decoding
A small draft model or n-gram speculation proposes candidate tokens that the large model verifies. Goal is reduced latency.
Disaggregated Prefill
Separates prefill and decode into different workers/GPUs. Helpful for production workloads mixing long inputs with short outputs.
Structured Outputs
Enforces JSON schema, tool calls, and reasoning outputs at the serving layer, reducing parser failures in the agent harness.
| Lever | Primary effect | Biggest risk | Indicator |
|---|---|---|---|
| Prefix Caching | lower TTFT, less prefill cost | hit rate ~0% if prompts vary slightly | cache hit ratio, TTFT |
| Chunked Prefill | mitigates head-of-line blocking from long inputs | overhead grows with overly small chunks | p95 latency, queue time |
| Speculative Decoding | lower average TPOT | weak draft model causes rejects and slowdown | acceptance rate, TPOT |
| Disaggregated Prefill | higher throughput on mixed long/short workloads | extra inter-node communication cost | throughput, network util |
| Structured Outputs | fewer parser failures | some schemas hurt sampling quality | invalid JSON rate, score |
| Scenario | Static prefix | Dynamic part | Expected hit | TTFT impact |
|---|---|---|---|---|
| Ralph loop with the same PROMPT.md | long (3-8K) | small turn diff | 70-95% | very large |
| Multi-tenant chatbot with shared system prompt | medium (500-2K) | per-user message | 30-60% | moderate |
| RAG with new chunks each time | tiny | most of the prompt | 0-15% | almost none |
| Repeated analysis on the same codebase | very long (10K+) | only the diff | 80-99% | very large |
The more accurate the draft, the more tokens are accepted at once and the lower the TPOT. A poor draft turns verification into pure overhead and can make things slower.
Long inputs land on prefill workers; only the KV cache is forwarded to decode workers. This raises utilization in production where long inputs and short outputs mix.
Turning everything on at once is not optimization. Diagnose the bottleneck first, then enable one lever at a time.
| Workload | Primary bottleneck | First lever to enable | What to measure |
|---|---|---|---|
| Long repository context + small edits | prefill cost | prefix caching, chunked prefill | TTFT, cache hit ratio |
| Many short queries | batching efficiency | continuous batching, tuned max_num_seqs | throughput, queue time |
| Mixed long-doc summarization and short coding | long prefill blocking | chunked prefill, disaggregated prefill | p95 latency, queue time |
| Interactive coding assistant | first-token delay | speculative decoding, smaller max_tokens | TTFT, perceived latency |
| Strict JSON / tool output | parser failure | structured outputs, tool parser settings | invalid JSON rate |
| Ralph-loop capstone | repeated prefill | prefix caching + chunked prefill | TTFT, cache hit |
Lab reports must capture not “we enabled X” but how the bottleneck indicator changed before and after.
vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name local-coder \ --max-model-len 32768 \ --gpu-memory-utilization 0.90 \ --enable-prefix-caching \ --port 8000vllm serve Qwen/Qwen3-Coder-Next \ --served-model-name qwen3-coder \ --tensor-parallel-size 2 \ --max-model-len 65536 \ --enable-prefix-caching \ --enable-chunked-prefill \ --port 8000vllm serve zai-org/GLM-5.1 \ --served-model-name glm-5.1 \ --tensor-parallel-size 4 \ --enable-auto-tool-choice \ --port 8000vllm serve Qwen/Qwen3-Coder-Next \ --served-model-name qwen3-coder \ --speculative-model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --num-speculative-tokens 5 \ --port 8000A common tuning mistake is to look only at tokens/sec. Agent systems require the following indicators side by side.
| Indicator | Meaning | Bad signal |
|---|---|---|
| TTFT (Time to First Token) | latency to the first response token | long prefills blocking the queue |
| TPOT (Time per Output Token) | inter-token interval | low decode-stage GPU utilization |
| Throughput | tokens generated per second | poor batching or memory fragmentation |
| Queue time | how long requests waited | excess concurrency, missing admission control |
| Cache hit ratio | prefix reuse rate | prompt assembly differs each turn |
| Error rate | timeouts or failures | max_model_len, OOM, parser failure |
vLLM exposes /metrics by default. Build the headline panels for the capstone with the following PromQL.
# Panel 1: TTFT p95 by modelhistogram_quantile(0.95, sum by (le, model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))
# Panel 2: throughput (tokens/sec)sum by (model_name) (rate(vllm:generation_tokens_total[1m]))
# Panel 3: prefix cache hit ratiosum(rate(vllm:cache_hit_tokens_total[5m])) /sum(rate(vllm:prompt_tokens_total[5m]))
# Panel 4: queue depth (active vs waiting)vllm:num_requests_runningvllm:num_requests_waitingAnnotate each option-flip event on the panel so reviewers can see the before/after directly.
Accepting more requests when the server slows down breaks every team’s lab. Production serving makes a decision before each request is admitted.
| Control | Example policy |
|---|---|
| Max input length | course-level cap below max_model_len |
| Concurrent requests | per-team queue plus a global queue |
| Priority | demo and replay outrank experimental batches |
| Timeout | log prefill timeout and decode timeout separately |
| Fallback | reroute to a commercial API or a smaller model on failure |
# admission.py — per-team queue and token budgetimport asyncio, timefrom dataclasses import dataclass
@dataclassclass Budget: max_concurrent: int = 4 max_prompt_tokens: int = 16000 timeout_s: float = 60.0
team_locks: dict[str, asyncio.Semaphore] = {}
async def admit(team: str, prompt_tokens: int, budget: Budget) -> bool: if prompt_tokens > budget.max_prompt_tokens: return False sem = team_locks.setdefault(team, asyncio.Semaphore(budget.max_concurrent)) try: await asyncio.wait_for(sem.acquire(), timeout=budget.timeout_s) return True except asyncio.TimeoutError: return False
async def release(team: str): team_locks[team].release()In Agent OS Runtime terms, admission control is also a policy gate. A rejection is not always a failure — it can be a successful boundary-protection event.
In a DGX H100 lab shared by student teams, isolating resources beats a single shared model.
# Team A: 1g.10gb slice, training-grade modelCUDA_VISIBLE_DEVICES=MIG-GPU-a vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name team-a-coder \ --max-model-len 16384 \ --port 8001
# Team B: 3g.40gb slice, larger modelCUDA_VISIBLE_DEVICES=MIG-GPU-b vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --served-model-name team-b-coder \ --max-model-len 32768 \ --port 8002vLLM ships Prometheus/Grafana and OpenTelemetry examples. Week 12 covers them in depth, but Week 11 labs should already log at minimum:
{ "request_id": "run-20260512-001", "model": "local-coder", "prompt_tokens": 1842, "completion_tokens": 419, "ttft_ms": 820, "tpot_ms": 34, "cache_hit": true, "finish_reason": "stop"}This format aligns with the Agent OS Runtime .events.jsonl. The serving layer records tokens and latency; the agent runtime records tool calls, approvals, test results, and replay state.
Bring up the baseline server
vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name local-coder \ --max-model-len 32768 \ --port 8000Compare prefix caching on/off
Run 20 requests sharing the same system prompt and repository summary. Compare TTFT, throughput, and cache hit between --enable-prefix-caching off and on.
Compare chunked prefill on/off
Mix long file summarization (20K+ tokens) with short code-generation requests. Measure how much the long requests delay the short ones.
Write an OpenAI-compatible client
from openai import OpenAIimport time
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
started = time.perf_counter()response = client.chat.completions.create( model="local-coder", messages=[{"role": "user", "content": "Write a sample pytest fixture."}], temperature=0.2,)elapsed = time.perf_counter() - startedprint(elapsed, response.choices[0].message.content[:200])Add an admission-control queue
Apply the admission.py pattern to limit per-team concurrency. Count rejected requests in a separate metric (admission.rejected).
Prepare the dashboard
Log ttft_ms, tokens/sec, error_rate, and cache_hit to Prometheus/Grafana or a simple CSV. Week 12 wires these values to OpenTelemetry traces.
| Item | Pass criterion |
|---|---|
| Model loading | cold start and warm restart times recorded |
| API compatibility | OpenAI client can complete a chat call |
| Stability | 0-2 timeouts/OOMs out of 100 requests |
| Cost | comparison table of GPU-hour cost vs. API cost |
| Observability | at least request_id, model, tokens, latency logged |
| Security | external network/file access permissions separated |
| Admission control | per-team queue isolated, rejection events logged |
Week 12 builds telemetry, an event store, and the LLM-as-Judge gate on top of this vLLM server. Week 11 is “how to generate fast.” Week 12 is “how to run the generated results so you can trust them.”
Foundational
Observability
Papers / reports