Concepts
Distinguish open-weight from open-source, and place the major 2026 coding LLM families (Qwen3-Coder, DeepSeek-V4, GLM-5.1, MiniMax-M2.7, DeepSeek-Coder-V2-Lite) on the landscape.
Concepts
Distinguish open-weight from open-source, and place the major 2026 coding LLM families (Qwen3-Coder, DeepSeek-V4, GLM-5.1, MiniMax-M2.7, DeepSeek-Coder-V2-Lite) on the landscape.
Design
Compare commercial API, cloud agent, and local inference along data-boundary, cost, and operational-complexity axes, and map the choices into a decision tree.
Implementation
Bring up an OpenAI-compatible server with vllm serve on single GPU, multi-GPU, and MIG configurations, and unify alias / tool parser / token accounting through a model adapter.
Operations
Run identical tasks on a commercial model and a local model under the same harness, and quantitatively compare throughput, latency, failure rate, and projected cost.
The defining shift of 2026 is that open-weight models have become candidates that can be compared with commercial models inside real agent harnesses. Coding performance is no longer the only axis: tool calling, long context, inference-server compatibility, license, and operating cost are now compared together.
| Model | Release form | Parameters | Context | Strengths | Operational notes |
|---|---|---|---|---|---|
| Qwen3-Coder-Next | Apache 2.0 | 80B MoE, 3B active | 256K | agentic coding, tool calling, local development | More lightweight coding-agent candidate than the larger previous generation |
| DeepSeek-V4-Pro / Flash | Public model card | 1.6T / 284B MoE | 1M | long-context reasoning, coding, agentic tasks | Pro for quality, Flash for efficiency |
| GLM-5.1 | Public weights | public GLM family | long-context agent work | agentic engineering, coding, tool use | GLM-family comparison baseline |
| MiniMax-M2.7 | Public model card | 230B-class MoE | long agentic workflow | coding, search, office work, long-horizon agents | MiniMax-family agentic candidate |
| DeepSeek-Coder-V2-Lite | DeepSeek | 16B / 236B family | 128K | single-GPU labs, code edit/completion | Suitable for teaching labs, but not for a high-performance baseline |
Read operational constraints before benchmark tables. Evaluate candidates in this order.
| Step | Question | Disqualifier |
|---|---|---|
| License | Can it be used in class, research, and commercial PoCs? | Redistribution / service terms unclear |
| Serving support | Is vLLM, SGLang, or Transformers officially supported? | Only community patches, no official examples |
| Tool format | How are tool calls, JSON mode, and reasoning outputs represented? | Parser must be patched ad hoc each time |
| Context policy | Do max context and recommended context differ? | “1M context” claim without GPU budget |
| Failure behavior | How do OOM, invalid JSON, refusal, and timeout surface? | Failures unstructured, retry policy hard to design |
The entries in the table are candidates as of the lecture date. The capstone selection criterion is not the model name but the reproducible result of running the same task packet with the same harness.
Qwen3-Coder-Next is an 80B-total / 3B-active MoE model with 256K context, foregrounding agentic coding and tool calling. It is a smaller and faster coding-agent candidate than its larger predecessors and serves as our Qwen-family baseline.
Operationally, Qwen3-Coder is meaningful beyond raw coding performance because long repository context, function-calling format, and vLLM/SGLang deployment support combine to make it a viable candidate for building agentic coding harnesses on local or institutional servers.
DeepSeek-V4-Pro and DeepSeek-V4-Flash are comparison candidates from the DeepSeek V4-family model cards. With 1.6T/284B MoE and 1M context, V4-Pro targets quality and long-context reasoning; V4-Flash targets efficiency and throughput.
| Family member | Course usage |
|---|---|
| DeepSeek-Coder-V2-Lite | Easy-to-run educational model |
| DeepSeek-V4-Pro | DeepSeek high-capability comparison baseline |
| DeepSeek-V4-Flash | Cost / throughput baseline |
| Earlier DeepSeek generations | Long-context regression baseline |
GLM-5.1 is a public GLM-family candidate for comparing agentic engineering and coding workflows. The lesson is constant: changing the model= value is not sufficient — the harness must understand each model’s tool/reasoning parser and serving options.
MiniMax-M2.7 is a MiniMax-family agentic-workflow candidate, emphasizing search, office work, and long-horizon agentic tasks alongside coding. Provider benchmark numbers are treated as candidate signals, not as final rankings. The Week 10 lab compares them on the same task, harness, budget, and rubric.
Adopting open-weight models is not limited to API cost reduction. There are three real options.
Commercial API
Use Claude, GPT, or Gemini APIs directly. High quality and easy to operate, but data boundary, cost predictability, and rate limits must be managed.
Cloud Agents
Use Codex Web/App, Claude Code on the Web, or GitHub Agent HQ — remote sandbox plus GitHub integration. Strong for asynchronous work and review flow.
Local Inference
Serve OpenAI-compatible APIs through vLLM/SGLang. Strong on data boundaries and predictable GPU cost; operational complexity is higher.
| Criterion | Commercial API | Cloud Agent | Local Inference |
|---|---|---|---|
| Time-to-start | very fast | fast | slow |
| Data control | low–medium | medium | high |
| Long-term cost | usage-driven | seat / usage | GPU fixed cost |
| Access to newly released models | fast | fast | after model release |
| Operational difficulty | low | medium | high |
| Educational value | API design | workflow design | MLOps / infra understanding |
The tree is not gospel — it is the starting point for a written decision. Capture the branch and a one-to-two-sentence rationale in the capstone ADR.
Operating an open-weight model needs an inference server. The course uses vLLM by default, but you should be able to compare candidates.
| Criterion | vLLM | SGLang | TGI (Hugging Face) |
|---|---|---|---|
| OpenAI-compatible API | yes | yes | yes |
| Continuous batching | ✓ | ✓ | ✓ |
| Prefix caching | automatic | RadixAttention | partial |
| Speculative decoding | ✓ | ✓ | some models |
| Disaggregated prefill | ✓ | experimental | × |
| Tool / structured output | ✓ (auto-tool, guided JSON) | ✓ (constrained decoding) | ✓ |
| Operational complexity | mid | mid–high | low |
| Community | very active | growing fast | stable |
| Recommended use | default labs / capstone | RadixAttention study | quick PoC |
To compare commercial APIs and local inference on the same axis, account for both token price and GPU amortization.
| Item | Formula |
|---|---|
| Per-call commercial cost | (prompt_tokens × in_price) + (completion_tokens × out_price) |
| GPU hourly cost | rental rate or (purchase / depreciation hours) + power |
| Per-call local cost | GPU hourly cost × (run_latency_s / 3600) / concurrent calls |
| Break-even calls | GPU hourly cost / average per-call API cost |
Example - Commercial API example: $3/M in, $15/M out → ~$0.012 per call (3K in, 600 out) - DGX H100 ~ $4.5/hr (instructional-rate assumption), 4 concurrent calls - Local cost ≈ 4.5 / 4 / (3600/12) ≈ $0.0038 - Break-even: above ~375 calls per hour, local winsThis table can be used in the capstone slide titled “why we chose local.”
See the AI coding tool selection guide for terminal-based AI CLIs.
vLLM and SGLang both expose OpenAI-compatible /v1/chat/completions (or Responses-style) clients. Using this interface decouples the harness from any single model provider.
from openai import OpenAI
client = OpenAI( base_url="http://localhost:8000/v1", api_key="local-dev-token",)
response = client.chat.completions.create( model="Qwen/Qwen3-Coder-Next", messages=[ {"role": "system", "content": "You are a careful coding agent."}, {"role": "user", "content": "Add tests for the parser edge cases."}, ], temperature=0.2,)The point is the abstraction stack.
| Layer | Responsibility |
|---|---|
| Agent CLI | file I/O, command execution, user approval, workflow |
| Gateway | model routing, budget, rate limits, audit |
| Serving | batching, KV cache, speculative decoding, GPU scheduling |
| Model | token generation, tool-call JSON, reasoning output |
OpenAI compatibility is the starting line. To swap models in production, the adapter must absorb the following.
| Item | Implementation criterion |
|---|---|
| Model alias | Use stable course-level names like local-coder, qwen3-coder, glm-5.1 |
| Message template | system/developer/user/tool messages must not collide with the model’s chat template |
| Tool parser | Per-model parsing for tool-call JSON, XML-like blocks, and plain-text plans |
| Retry policy | Distinguish timeout, invalid JSON, context overflow, and refusal as separate failure reasons |
| Token accounting | Connect prompt/completion/cache tokens with run_id |
| Evaluation hook | Same task packet must be runnable across commercial and local models |
Without this checklist, “model interchangeable” is an unimplemented claim, not an operational property.
# adapter.py — a thin layer that unifies alias, tool parsing, and token accountingfrom dataclasses import dataclassfrom typing import Callable, Any
@dataclassclass ModelSpec: alias: str backend_url: str backend_model: str tool_parser: Callable[[str], list[dict]] chat_template: str | None = None
REGISTRY: dict[str, ModelSpec] = {}
def register(spec: ModelSpec) -> None: REGISTRY[spec.alias] = spec
def call(alias: str, messages: list[dict], **kw) -> dict: spec = REGISTRY[alias] from openai import OpenAI client = OpenAI(base_url=spec.backend_url, api_key="local") resp = client.chat.completions.create( model=spec.backend_model, messages=messages, **kw ) text = resp.choices[0].message.content tools = spec.tool_parser(text) return { "alias": alias, "text": text, "tools": tools, "usage": resp.usage.model_dump(), }Verify the environment
nvidia-smipython --versionuv venv .venvsource .venv/bin/activateuv pip install vllm openaiPick a lab model
vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name local-coder \ --max-model-len 32768 \ --port 8000vllm serve Qwen/Qwen3-Coder-Next \ --served-model-name qwen3-coder \ --tensor-parallel-size 2 \ --max-model-len 65536 \ --port 8000vllm serve zai-org/GLM-5.1 \ --served-model-name glm-5.1 \ --tensor-parallel-size 4 \ --enable-auto-tool-choice \ --port 8000# GPU 0 slice a: educational modelCUDA_VISIBLE_DEVICES=MIG-0/0 vllm serve \ deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \ --served-model-name local-coder --port 8001 &
# GPU 0 slice b: evaluation modelCUDA_VISIBLE_DEVICES=MIG-0/1 vllm serve \ Qwen/Qwen3-Coder-Next \ --served-model-name qwen3-coder --port 8002 &Test the OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "local-coder", "messages": [ {"role": "user", "content": "Implement a stable topological sort in Python."} ], "temperature": 0.2 }'Run the same five tasks on two models
Each team runs five identical tasks against one commercial API and one local model.
| Task | Evaluation criterion |
|---|---|
| Bug fix | Tests pass |
| Refactoring | Behavior preserved + readability |
| Test generation | Coverage and edge cases |
| Documentation update | Faithful to the change |
| Add a CLI feature | Requirements met + UX |
Measure cost and throughput
python -m vllm.benchmarks.benchmark_serving \ --backend openai-chat \ --base-url http://localhost:8000 \ --model local-coder \ --num-prompts 100 \ --request-rate 4Register a model adapter
Use the adapter.py pattern to register two or three aliases (e.g., local-coder, qwen3-coder, gpt-baseline) and run the same task packet by changing only the alias.
Due: 2026-05-12 23:59
Requirements:
adapter.py) with at least two registered aliasesModel cards
Serving infrastructure
Agent tools