Skip to content

Week 11: vLLM High-Throughput Inference Optimization

Phase 4Week 11AdvancedLecture: 2026-05-12

Concepts

Explain in one sentence the bottleneck each optimization solves: PagedAttention, prefix caching, chunked prefill, speculative decoding, and disaggregated prefill.

Design

Diagnose a workload (chat / batch / RAG / agent) and decide, in a decision table, which option to enable first and why.

Implementation

Bring up single-GPU, multi-GPU, and MIG-isolated servers with vllm serve, capture latency and throughput from an OpenAI-compatible client, and implement an admission-control queue in code.

Operations

Expose the five core indicators (TTFT, TPOT, queue time, cache hit ratio, error rate) to Prometheus/Grafana and report quantitative differences before and after each tuning change.


The inference server is the OS of the model

Section titled “The inference server is the OS of the model”

Deploying an LLM is not the same as starting the model file. The serving layer is what actually defines the user-visible quality.

Operational problemWhat the inference server does
GPU memory pressureKV cache management, quantization, tensor parallelism
Request latencycontinuous batching, chunked prefill, speculative decoding
Repeated prompt costautomatic prefix caching
Long-input floodingprefill/decode separation, max-context policy
Failure analysismetrics, traces, request logs, token accounting

vLLM wraps these problems behind an OpenAI-compatible API so the agent harness keeps the same interface even when models are swapped.

Traditional inference reserves a large contiguous block of KV cache. When request lengths differ, the unused regions become dead memory; when long inputs mix with short ones, fragmentation grows quickly. vLLM’s PagedAttention treats KV cache like OS virtual memory: it splits the cache into blocks and links them as needed.

Traditional[KV cache: reserve max length]short requests still pay full cost
PagedAttention[block] [block] [block]link only the blocks you need
PagedAttention KV Cache Mapping
Logical KV cache (per-sequence)
seq A — t0..t511two blocks (block 0, 1)
seq B — t0..t1023three blocks (block 2, 3, 4)
seq C — t0..t127one block (block 5)
Physical GPU memory blocks (16-token pages)
block 0→ seq A
block 1→ seq A
block 2→ seq B
block 3→ seq B
block 4→ seq B
block 5→ seq C
block 6free

A page table maps logical sequences to whichever physical blocks are free. Fragmentation effectively disappears because new requests can grab any free block.

PagedAttention is the foundation. The 2026 operations question is not “do we use PagedAttention” but which optimization to enable for which workload.

Automatic Prefix Caching

Reuses the KV cache of repeated system prompts, tool schemas, and AGENTS.md/CLAUDE.md prefixes. Effective for workloads with long static prefixes such as Ralph loops.

Chunked Prefill

Splits long prefills into small chunks and interleaves them with decode requests, preventing one long input from blocking the queue.

Speculative Decoding

A small draft model or n-gram speculation proposes candidate tokens that the large model verifies. Goal is reduced latency.

Disaggregated Prefill

Separates prefill and decode into different workers/GPUs. Helpful for production workloads mixing long inputs with short outputs.

Structured Outputs

Enforces JSON schema, tool calls, and reasoning outputs at the serving layer, reducing parser failures in the agent harness.

LeverPrimary effectBiggest riskIndicator
Prefix Cachinglower TTFT, less prefill costhit rate ~0% if prompts vary slightlycache hit ratio, TTFT
Chunked Prefillmitigates head-of-line blocking from long inputsoverhead grows with overly small chunksp95 latency, queue time
Speculative Decodinglower average TPOTweak draft model causes rejects and slowdownacceptance rate, TPOT
Disaggregated Prefillhigher throughput on mixed long/short workloadsextra inter-node communication costthroughput, network util
Structured Outputsfewer parser failuressome schemas hurt sampling qualityinvalid JSON rate, score
ScenarioStatic prefixDynamic partExpected hitTTFT impact
Ralph loop with the same PROMPT.mdlong (3-8K)small turn diff70-95%very large
Multi-tenant chatbot with shared system promptmedium (500-2K)per-user message30-60%moderate
RAG with new chunks each timetinymost of the prompt0-15%almost none
Repeated analysis on the same codebasevery long (10K+)only the diff80-99%very large
Speculative Decoding Sequence
Client
Target (large)
Draft (small)
① Client → Target: prompt
until stop token
② Target → Draft: speculate next k tokens
③ Draft → Target: candidate tokens [t1..tk]
④ Target verifies all k tokens (one forward pass)
accept all kTarget → Client: emit k tokens
reject at jemit [t1..tj] + 1 corrected

The more accurate the draft, the more tokens are accepted at once and the lower the TPOT. A poor draft turns verification into pure overhead and can make things slower.

Disaggregated Prefill Architecture
Frontend — Load Balancerrequest ingress, response egress
▼ fan out to prefill
Prefill Worker 1long-input KV generation
Prefill Worker 2long-input KV generation
▼ KV cache handoff
KV Cache Storeshort-lived cache, fast lookup for decode workers
▼ route to decode
Decode Worker 1short-output token generation
Decode Worker 2short-output token generation
Decode Worker 3short-output token generation
▲ responses back to LB

Long inputs land on prefill workers; only the KV cache is forwarded to decode workers. This raises utilization in production where long inputs and short outputs mix.

Turning everything on at once is not optimization. Diagnose the bottleneck first, then enable one lever at a time.

WorkloadPrimary bottleneckFirst lever to enableWhat to measure
Long repository context + small editsprefill costprefix caching, chunked prefillTTFT, cache hit ratio
Many short queriesbatching efficiencycontinuous batching, tuned max_num_seqsthroughput, queue time
Mixed long-doc summarization and short codinglong prefill blockingchunked prefill, disaggregated prefillp95 latency, queue time
Interactive coding assistantfirst-token delayspeculative decoding, smaller max_tokensTTFT, perceived latency
Strict JSON / tool outputparser failurestructured outputs, tool parser settingsinvalid JSON rate
Ralph-loop capstonerepeated prefillprefix caching + chunked prefillTTFT, cache hit

Lab reports must capture not “we enabled X” but how the bottleneck indicator changed before and after.

Terminal window
vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
--served-model-name local-coder \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--port 8000

A common tuning mistake is to look only at tokens/sec. Agent systems require the following indicators side by side.

IndicatorMeaningBad signal
TTFT (Time to First Token)latency to the first response tokenlong prefills blocking the queue
TPOT (Time per Output Token)inter-token intervallow decode-stage GPU utilization
Throughputtokens generated per secondpoor batching or memory fragmentation
Queue timehow long requests waitedexcess concurrency, missing admission control
Cache hit ratioprefix reuse rateprompt assembly differs each turn
Error ratetimeouts or failuresmax_model_len, OOM, parser failure

vLLM exposes /metrics by default. Build the headline panels for the capstone with the following PromQL.

# Panel 1: TTFT p95 by model
histogram_quantile(0.95,
sum by (le, model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))
# Panel 2: throughput (tokens/sec)
sum by (model_name) (rate(vllm:generation_tokens_total[1m]))
# Panel 3: prefix cache hit ratio
sum(rate(vllm:cache_hit_tokens_total[5m]))
/
sum(rate(vllm:prompt_tokens_total[5m]))
# Panel 4: queue depth (active vs waiting)
vllm:num_requests_running
vllm:num_requests_waiting

Annotate each option-flip event on the panel so reviewers can see the before/after directly.

Accepting more requests when the server slows down breaks every team’s lab. Production serving makes a decision before each request is admitted.

ControlExample policy
Max input lengthcourse-level cap below max_model_len
Concurrent requestsper-team queue plus a global queue
Prioritydemo and replay outrank experimental batches
Timeoutlog prefill timeout and decode timeout separately
Fallbackreroute to a commercial API or a smaller model on failure
# admission.py — per-team queue and token budget
import asyncio, time
from dataclasses import dataclass
@dataclass
class Budget:
max_concurrent: int = 4
max_prompt_tokens: int = 16000
timeout_s: float = 60.0
team_locks: dict[str, asyncio.Semaphore] = {}
async def admit(team: str, prompt_tokens: int, budget: Budget) -> bool:
if prompt_tokens > budget.max_prompt_tokens:
return False
sem = team_locks.setdefault(team, asyncio.Semaphore(budget.max_concurrent))
try:
await asyncio.wait_for(sem.acquire(), timeout=budget.timeout_s)
return True
except asyncio.TimeoutError:
return False
async def release(team: str):
team_locks[team].release()

In Agent OS Runtime terms, admission control is also a policy gate. A rejection is not always a failure — it can be a successful boundary-protection event.

In a DGX H100 lab shared by student teams, isolating resources beats a single shared model.

Terminal window
# Team A: 1g.10gb slice, training-grade model
CUDA_VISIBLE_DEVICES=MIG-GPU-a vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
--served-model-name team-a-coder \
--max-model-len 16384 \
--port 8001
# Team B: 3g.40gb slice, larger model
CUDA_VISIBLE_DEVICES=MIG-GPU-b vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
--served-model-name team-b-coder \
--max-model-len 32768 \
--port 8002

vLLM ships Prometheus/Grafana and OpenTelemetry examples. Week 12 covers them in depth, but Week 11 labs should already log at minimum:

{
"request_id": "run-20260512-001",
"model": "local-coder",
"prompt_tokens": 1842,
"completion_tokens": 419,
"ttft_ms": 820,
"tpot_ms": 34,
"cache_hit": true,
"finish_reason": "stop"
}

This format aligns with the Agent OS Runtime .events.jsonl. The serving layer records tokens and latency; the agent runtime records tool calls, approvals, test results, and replay state.

  1. Bring up the baseline server

    Terminal window
    vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
    --served-model-name local-coder \
    --max-model-len 32768 \
    --port 8000
  2. Compare prefix caching on/off

    Run 20 requests sharing the same system prompt and repository summary. Compare TTFT, throughput, and cache hit between --enable-prefix-caching off and on.

  3. Compare chunked prefill on/off

    Mix long file summarization (20K+ tokens) with short code-generation requests. Measure how much the long requests delay the short ones.

  4. Write an OpenAI-compatible client

    from openai import OpenAI
    import time
    client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
    started = time.perf_counter()
    response = client.chat.completions.create(
    model="local-coder",
    messages=[{"role": "user", "content": "Write a sample pytest fixture."}],
    temperature=0.2,
    )
    elapsed = time.perf_counter() - started
    print(elapsed, response.choices[0].message.content[:200])
  5. Add an admission-control queue

    Apply the admission.py pattern to limit per-team concurrency. Count rejected requests in a separate metric (admission.rejected).

  6. Prepare the dashboard

    Log ttft_ms, tokens/sec, error_rate, and cache_hit to Prometheus/Grafana or a simple CSV. Week 12 wires these values to OpenTelemetry traces.

ItemPass criterion
Model loadingcold start and warm restart times recorded
API compatibilityOpenAI client can complete a chat call
Stability0-2 timeouts/OOMs out of 100 requests
Costcomparison table of GPU-hour cost vs. API cost
Observabilityat least request_id, model, tokens, latency logged
Securityexternal network/file access permissions separated
Admission controlper-team queue isolated, rejection events logged

Week 12 builds telemetry, an event store, and the LLM-as-Judge gate on top of this vLLM server. Week 11 is “how to generate fast.” Week 12 is “how to run the generated results so you can trust them.”

  1. The inference server is the model’s OS: model weights alone do not produce a service. KV cache, batching, queues, and observability must be co-designed.
  2. PagedAttention is the floor: it nearly eliminates fragmentation, but the real operational decision is which optimization sits on top.
  3. Evaluate the five levers separately: prefix caching, chunked prefill, speculative decoding, disaggregated prefill, and structured outputs each address a different bottleneck.
  4. Diagnose the workload first: do not flip switches blindly — measure TTFT, TPOT, queue time, and cache hit before tuning.
  5. Throughput alone is a trap: the agent UX is governed by TTFT, p95 latency, and queue time.
  6. Admission control is a policy gate: a rejected request is a boundary-protection event, not a failure.
  7. Multi-tenancy is solved by boundaries: MIG slices, separate ports, and per-team model names keep one team’s mistake from breaking the lab.

Foundational

Observability

Papers / reports

  • Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023)
  • Leviathan et al., “Fast Inference from Transformers via Speculative Decoding” (ICML 2023)
  • Patel et al., “Splitwise: Efficient Generative LLM Inference Using Phase Splitting” (ISCA 2024)