Week 11: High-Throughput Inference Optimization with vLLM
Theory
Section titled “Theory”PagedAttention: vLLM’s Core Innovation
Section titled “PagedAttention: vLLM’s Core Innovation”Traditional LLM inference allocates the KV cache (Key-Value Cache) as contiguous memory blocks, resulting in severe memory waste. vLLM’s PagedAttention applies the OS virtual memory paging concept to the KV cache.
Traditional Approach[KV Cache: 100% reserved]Actual usage: 30% → Wasted: 70%
PagedAttention[Block 1] [Block 2] [Block 3]Dynamic allocation as needed → Wasted: ~4%
vLLM Optimization Configuration
Section titled “vLLM Optimization Configuration”# Production vLLM configurationfrom vllm import LLM, SamplingParams
llm = LLM( model="deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct", # Memory settings gpu_memory_utilization=0.90, # Use 90% of GPU memory max_model_len=32768, # Performance optimization enable_prefix_caching=True, # Cache repeated prompts enable_chunked_prefill=True, # Chunked prefill # MIG environment tensor_parallel_size=1, # 1 MIG slice)
# Batch processingrequests = [ "Implement quicksort in Python", "Implement binary search in JavaScript", "Implement an HTTP server in Go",]
sampling_params = SamplingParams( temperature=0.1, # Low temperature for code generation top_p=0.9, max_tokens=1024,)
outputs = llm.generate(requests, sampling_params)Multi-Model Serving Across MIG Slices
Section titled “Multi-Model Serving Across MIG Slices”# Student A: Qwen3-Coder (MIG slice 1, 3g.40gb)CUDA_VISIBLE_DEVICES=MIG-GPU-xxx python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-Coder-32B-Instruct --port 8001
# Student B: GLM-4.7 (MIG slice 2, 3g.40gb)CUDA_VISIBLE_DEVICES=MIG-GPU-yyy python -m vllm.entrypoints.openai.api_server \ --model THUDM/glm-4-9b-chat --port 8002
# Student C: DeepSeek-Coder-V2-Lite (MIG slice 3, 2g.20gb)CUDA_VISIBLE_DEVICES=MIG-GPU-zzz python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 8003Practicum
Section titled “Practicum”-
Apply vLLM Optimization Settings — Compare configurations for
gpu_memory_utilizationandenable_prefix_caching -
Implement Batch Processing — Handle concurrent requests and measure throughput
-
Verify Prefix Caching Effect — Measure cache hit rate using a Ralph Loop simulation
-
Performance Profiling — Monitor GPU utilization with
nvidia-smi dmon
Preview of Next Week
Section titled “Preview of Next Week”In Week 12, we will build a telemetry system on top of this vLLM server and implement an LLM-as-Judge evaluation framework.