Week 10: Open-Weight Coding LLMs and Local Deployment

Phase 4Week 10AdvancedLecture: 2026-05-05

Theory

Learning Objectives

Concepts

Distinguish open-weight from open-source, and place the major 2026 coding LLM families (Qwen3-Coder, DeepSeek-V4, GLM-5.1, MiniMax-M2.7, DeepSeek-Coder-V2-Lite) on the landscape.

Design

Compare commercial API, cloud agent, and local inference along data-boundary, cost, and operational-complexity axes, and map the choices into a decision tree.

Implementation

Bring up an OpenAI-compatible server with vllm serve on single GPU, multi-GPU, and MIG configurations, and unify alias / tool parser / token accounting through a model adapter.

Operations

Run identical tasks on a commercial model and a local model under the same harness, and quantitatively compare throughput, latency, failure rate, and projected cost.

2026 open-weight coding LLM landscape

The defining shift of 2026 is that open-weight models have become candidates that can be compared with commercial models inside real agent harnesses. Coding performance is no longer the only axis: tool calling, long context, inference-server compatibility, license, and operating cost are now compared together.

Family tree

Open-weight Coding LLM 2026

Open-weight Coding LLM (2026)

▼

QwenQwen3-Coder-Next 80B MoE · 3B active · 256KQwen3-Coder-30B-A3B

DeepSeekV4-Pro 1.6T MoE · 1M ctxV4-Flash 284B MoE · Coder-V2-Lite 16B

GLMGLM-5.1agentic engineering

MiniMaxMiniMax-M2.7agent + office

Major models compared (2026-05)

Model	Release form	Parameters	Context	Strengths	Operational notes
Qwen3-Coder-Next	Apache 2.0	80B MoE, 3B active	256K	agentic coding, tool calling, local development	More lightweight coding-agent candidate than the larger previous generation
DeepSeek-V4-Pro / Flash	Public model card	1.6T / 284B MoE	1M	long-context reasoning, coding, agentic tasks	Pro for quality, Flash for efficiency
GLM-5.1	Public weights	public GLM family	long-context agent work	agentic engineering, coding, tool use	GLM-family comparison baseline
MiniMax-M2.7	Public model card	230B-class MoE	long agentic workflow	coding, search, office work, long-horizon agents	MiniMax-family agentic candidate
DeepSeek-Coder-V2-Lite	DeepSeek	16B / 236B family	128K	single-GPU labs, code edit/completion	Suitable for teaching labs, but not for a high-performance baseline

Reading model cards in order

Read operational constraints before benchmark tables. Evaluate candidates in this order.

Step	Question	Disqualifier
License	Can it be used in class, research, and commercial PoCs?	Redistribution / service terms unclear
Serving support	Is vLLM, SGLang, or Transformers officially supported?	Only community patches, no official examples
Tool format	How are tool calls, JSON mode, and reasoning outputs represented?	Parser must be patched ad hoc each time
Context policy	Do max context and recommended context differ?	“1M context” claim without GPU budget
Failure behavior	How do OOM, invalid JSON, refusal, and timeout surface?	Failures unstructured, retry policy hard to design

The entries in the table are candidates as of the lecture date. The capstone selection criterion is not the model name but the reproducible result of running the same task packet with the same harness.

Qwen3-Coder

Qwen3-Coder-Next is an 80B-total / 3B-active MoE model with 256K context, foregrounding agentic coding and tool calling. It is a smaller and faster coding-agent candidate than its larger predecessors and serves as our Qwen-family baseline.

Operationally, Qwen3-Coder is meaningful beyond raw coding performance because long repository context, function-calling format, and vLLM/SGLang deployment support combine to make it a viable candidate for building agentic coding harnesses on local or institutional servers.

DeepSeek-V4

DeepSeek-V4-Pro and DeepSeek-V4-Flash are comparison candidates from the DeepSeek V4-family model cards. With 1.6T/284B MoE and 1M context, V4-Pro targets quality and long-context reasoning; V4-Flash targets efficiency and throughput.

Family member	Course usage
DeepSeek-Coder-V2-Lite	Easy-to-run educational model
DeepSeek-V4-Pro	DeepSeek high-capability comparison baseline
DeepSeek-V4-Flash	Cost / throughput baseline
Earlier DeepSeek generations	Long-context regression baseline

GLM-5.1

GLM-5.1 is a public GLM-family candidate for comparing agentic engineering and coding workflows. The lesson is constant: changing the model= value is not sufficient — the harness must understand each model’s tool/reasoning parser and serving options.

MiniMax-M2.7

MiniMax-M2.7 is a MiniMax-family agentic-workflow candidate, emphasizing search, office work, and long-horizon agentic tasks alongside coding. Provider benchmark numbers are treated as candidate signals, not as final rankings. The Week 10 lab compares them on the same task, harness, budget, and rubric.

Choosing a model is choosing a harness

Adopting open-weight models is not limited to API cost reduction. There are three real options.

Commercial API

Use Claude, GPT, or Gemini APIs directly. High quality and easy to operate, but data boundary, cost predictability, and rate limits must be managed.

Cloud Agents

Use Codex Web/App, Claude Code on the Web, or GitHub Agent HQ — remote sandbox plus GitHub integration. Strong for asynchronous work and review flow.

Local Inference

Serve OpenAI-compatible APIs through vLLM/SGLang. Strong on data boundaries and predictable GPU cost; operational complexity is higher.

Decision matrix

Criterion	Commercial API	Cloud Agent	Local Inference
Time-to-start	very fast	fast	slow
Data control	low–medium	medium	high
Long-term cost	usage-driven	seat / usage	GPU fixed cost
Access to newly released models	fast	fast	after model release
Operational difficulty	low	medium	high
Educational value	API design	workflow design	MLOps / infra understanding

Model-selection decision tree

Model Selection Decision Tree

Q1. Can data leave our boundary?

✓ Yes

✗ No→ Local inference vLLM/SGLang

▼

Q2. Need top-tier model quality?

✓ Yes

✗ No→ Commercial API + cache + budget gate

▼

Q3. Interactive IDE or async PR?

InteractiveCommercial API + Claude Code / Codex CLI

Async PRCloud agent — GitHub Agent HQ

▼

Q4. (Local path) Is one model enough?

✓ YesSingle model + admission control

✗ NoRouting gateway

The tree is not gospel — it is the starting point for a written decision. Capture the branch and a one-to-two-sentence rationale in the capstone ADR.

vLLM / SGLang / TGI compared

Operating an open-weight model needs an inference server. The course uses vLLM by default, but you should be able to compare candidates.

Criterion	vLLM	SGLang	TGI (Hugging Face)
OpenAI-compatible API	yes	yes	yes
Continuous batching	✓	✓	✓
Prefix caching	automatic	RadixAttention	partial
Speculative decoding	✓	✓	some models
Disaggregated prefill	✓	experimental	×
Tool / structured output	✓ (auto-tool, guided JSON)	✓ (constrained decoding)	✓
Operational complexity	mid	mid–high	low
Community	very active	growing fast	stable
Recommended use	default labs / capstone	RadixAttention study	quick PoC

Cost worksheet

To compare commercial APIs and local inference on the same axis, account for both token price and GPU amortization.

Item	Formula
Per-call commercial cost	`(prompt_tokens × in_price) + (completion_tokens × out_price)`
GPU hourly cost	`rental rate or (purchase / depreciation hours) + power`
Per-call local cost	`GPU hourly cost × (run_latency_s / 3600) / concurrent calls`
Break-even calls	`GPU hourly cost / average per-call API cost`

Example
  - Commercial API example: $3/M in, $15/M out → ~$0.012 per call (3K in, 600 out)
  - DGX H100 ~ $4.5/hr (instructional-rate assumption), 4 concurrent calls
  - Local cost ≈ 4.5 / 4 / (3600/12) ≈ $0.0038
  - Break-even: above ~375 calls per hour, local wins

This table can be used in the capstone slide titled “why we chose local.”

Agentic coding tool ecosystem

AGENTIC CODING TOOL ECOSYSTEM (2026)

Agent Harness

Claude Code: subagents, hooks, skills, MCP
Codex CLI/App/Web: sandbox, AGENTS.md, MCP, subagents
Gemini CLI: 1M context, Google Search grounding, MCP
GitHub Agent HQ: choose Claude or Codex inside GitHub

Open-Weight Models

Qwen3-Coder-Next
DeepSeek-V4-Pro / Flash
GLM-5.1
MiniMax-M2.7
DeepSeek-Coder-V2-Lite

Serving + Control Plane

vLLM / SGLang: OpenAI-compatible serving
LiteLLM / gateway: routing, budget, policy
OpenTelemetry: trace, metrics, logs
Agent OS Runtime: event store, contracts, replay

See the AI coding tool selection guide for terminal-based AI CLIs.

Why OpenAI-compatible APIs become the standard interface

vLLM and SGLang both expose OpenAI-compatible /v1/chat/completions (or Responses-style) clients. Using this interface decouples the harness from any single model provider.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="local-dev-token",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-Coder-Next",
    messages=[
        {"role": "system", "content": "You are a careful coding agent."},
        {"role": "user", "content": "Add tests for the parser edge cases."},
    ],
    temperature=0.2,
)

The point is the abstraction stack.

Layer	Responsibility
Agent CLI	file I/O, command execution, user approval, workflow
Gateway	model routing, budget, rate limits, audit
Serving	batching, KV cache, speculative decoding, GPU scheduling
Model	token generation, tool-call JSON, reasoning output

Model adapter checklist

OpenAI compatibility is the starting line. To swap models in production, the adapter must absorb the following.

Item	Implementation criterion
Model alias	Use stable course-level names like `local-coder`, `qwen3-coder`, `glm-5.1`
Message template	system/developer/user/tool messages must not collide with the model’s chat template
Tool parser	Per-model parsing for tool-call JSON, XML-like blocks, and plain-text plans
Retry policy	Distinguish timeout, invalid JSON, context overflow, and refusal as separate failure reasons
Token accounting	Connect prompt/completion/cache tokens with run_id
Evaluation hook	Same task packet must be runnable across commercial and local models

Without this checklist, “model interchangeable” is an unimplemented claim, not an operational property.

# adapter.py — a thin layer that unifies alias, tool parsing, and token accounting
from dataclasses import dataclass
from typing import Callable, Any

@dataclass
class ModelSpec:
    alias: str
    backend_url: str
    backend_model: str
    tool_parser: Callable[[str], list[dict]]
    chat_template: str | None = None

REGISTRY: dict[str, ModelSpec] = {}

def register(spec: ModelSpec) -> None:
    REGISTRY[spec.alias] = spec

def call(alias: str, messages: list[dict], **kw) -> dict:
    spec = REGISTRY[alias]
    from openai import OpenAI
    client = OpenAI(base_url=spec.backend_url, api_key="local")
    resp = client.chat.completions.create(
        model=spec.backend_model, messages=messages, **kw
    )
    text = resp.choices[0].message.content
    tools = spec.tool_parser(text)
    return {
        "alias": alias,
        "text": text,
        "tools": tools,
        "usage": resp.usage.model_dump(),
    }

Practicum

Deploy open-weight models with vLLM

Verify the environment

nvidia-smi
python --version
uv venv .venv
source .venv/bin/activate
uv pip install vllm openai

Pick a lab model
Terminal window
vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name local-coder \ --max-model-len 32768 \ --port 8000
Terminal window
vllm serve Qwen/Qwen3-Coder-Next \ --served-model-name qwen3-coder \ --tensor-parallel-size 2 \ --max-model-len 65536 \ --port 8000
Terminal window
vllm serve zai-org/GLM-5.1 \ --served-model-name glm-5.1 \ --tensor-parallel-size 4 \ --enable-auto-tool-choice \ --port 8000
Terminal window
# GPU 0 slice a: educational model CUDA_VISIBLE_DEVICES=MIG-0/0 vllm serve \ deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name local-coder --port 8001 & # GPU 0 slice b: evaluation model CUDA_VISIBLE_DEVICES=MIG-0/1 vllm serve \ Qwen/Qwen3-Coder-Next \ --served-model-name qwen3-coder --port 8002 &

Test the OpenAI-compatible API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-coder",
    "messages": [
      {"role": "user", "content": "Implement a stable topological sort in Python."}
    ],
    "temperature": 0.2
  }'

Run the same five tasks on two models

Each team runs five identical tasks against one commercial API and one local model.

Task	Evaluation criterion
Bug fix	Tests pass
Refactoring	Behavior preserved + readability
Test generation	Coverage and edge cases
Documentation update	Faithful to the change
Add a CLI feature	Requirements met + UX

Measure cost and throughput

python -m vllm.benchmarks.benchmark_serving \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --model local-coder \
  --num-prompts 100 \
  --request-rate 4

Register a model adapter

Use the adapter.py pattern to register two or three aliases (e.g., local-coder, qwen3-coder, gpt-baseline) and run the same task packet by changing only the alias.

Assignment

Lab 10: vLLM deployment practicum

Due: 2026-05-12 23:59

Requirements:

Logs of a vLLM server running on DGX or local GPU
Three or more OpenAI-compatible API call results
Five identical tasks compared between one commercial model and one open-weight model
Tables for throughput (tokens/sec), latency, failure rate, and cost estimate
Final selection: which model for which task, with rationale
Model adapter code (adapter.py) with at least two registered aliases

Key Takeaways

Open-weight ≠ open-source: only weights are public; training data and code may not be. Read the license first.
Choosing a model is choosing a harness: the same model performs differently across harnesses. Hold the task packet and rubric constant for fair comparison.
Three options coexist: commercial API, cloud agent, and local inference each fit different work types.
vLLM, SGLang, and TGI are complementary: vLLM as default, SGLang for RadixAttention study, TGI for fast PoC.
Cost lives at the break-even line: do not look at token price alone — fold in GPU-hour cost and concurrency.
The model adapter is an operational asset: aliases, tool parsers, token accounting, and retry policies decide whether “swap models” is real.
Reproducible rationale, not model names: the capstone decision document is built from your lab evidence, not vendor benchmarks.