Skip to content

Week 11: High-Throughput Inference Optimization with vLLM

Phase 4Week 11AdvancedLecture: 2026-05-12

Traditional LLM inference allocates the KV cache (Key-Value Cache) as contiguous memory blocks, resulting in severe memory waste. vLLM’s PagedAttention applies the OS virtual memory paging concept to the KV cache.

Traditional Approach[KV Cache: 100% reserved]Actual usage: 30% → Wasted: 70%
PagedAttention[Block 1] [Block 2] [Block 3]Dynamic allocation as needed → Wasted: ~4%
# Production vLLM configuration
from vllm import LLM, SamplingParams
llm = LLM(
model="deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
# Memory settings
gpu_memory_utilization=0.90, # Use 90% of GPU memory
max_model_len=32768,
# Performance optimization
enable_prefix_caching=True, # Cache repeated prompts
enable_chunked_prefill=True, # Chunked prefill
# MIG environment
tensor_parallel_size=1, # 1 MIG slice
)
# Batch processing
requests = [
"Implement quicksort in Python",
"Implement binary search in JavaScript",
"Implement an HTTP server in Go",
]
sampling_params = SamplingParams(
temperature=0.1, # Low temperature for code generation
top_p=0.9,
max_tokens=1024,
)
outputs = llm.generate(requests, sampling_params)
Terminal window
# Student A: Qwen3-Coder (MIG slice 1, 3g.40gb)
CUDA_VISIBLE_DEVICES=MIG-GPU-xxx python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-Coder-32B-Instruct --port 8001
# Student B: GLM-4.7 (MIG slice 2, 3g.40gb)
CUDA_VISIBLE_DEVICES=MIG-GPU-yyy python -m vllm.entrypoints.openai.api_server \
--model THUDM/glm-4-9b-chat --port 8002
# Student C: DeepSeek-Coder-V2-Lite (MIG slice 3, 2g.20gb)
CUDA_VISIBLE_DEVICES=MIG-GPU-zzz python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 8003
  1. Apply vLLM Optimization Settings — Compare configurations for gpu_memory_utilization and enable_prefix_caching

  2. Implement Batch Processing — Handle concurrent requests and measure throughput

  3. Verify Prefix Caching Effect — Measure cache hit rate using a Ralph Loop simulation

  4. Performance Profiling — Monitor GPU utilization with nvidia-smi dmon

In Week 12, we will build a telemetry system on top of this vLLM server and implement an LLM-as-Judge evaluation framework.