Skip to content

Week 10: Open-Weight Coding LLMs and Local Deployment

Phase 4Week 10AdvancedLecture: 2026-05-05

Concepts

Distinguish open-weight from open-source, and place the major 2026 coding LLM families (Qwen3-Coder, DeepSeek-V4, GLM-5.1, MiniMax-M2.7, DeepSeek-Coder-V2-Lite) on the landscape.

Design

Compare commercial API, cloud agent, and local inference along data-boundary, cost, and operational-complexity axes, and map the choices into a decision tree.

Implementation

Bring up an OpenAI-compatible server with vllm serve on single GPU, multi-GPU, and MIG configurations, and unify alias / tool parser / token accounting through a model adapter.

Operations

Run identical tasks on a commercial model and a local model under the same harness, and quantitatively compare throughput, latency, failure rate, and projected cost.


The defining shift of 2026 is that open-weight models have become candidates that can be compared with commercial models inside real agent harnesses. Coding performance is no longer the only axis: tool calling, long context, inference-server compatibility, license, and operating cost are now compared together.

Open-weight Coding LLM 2026
Open-weight Coding LLM (2026)
QwenQwen3-Coder-Next 80B MoE · 3B active · 256KQwen3-Coder-30B-A3B
DeepSeekV4-Pro 1.6T MoE · 1M ctxV4-Flash 284B MoE · Coder-V2-Lite 16B
GLMGLM-5.1agentic engineering
MiniMaxMiniMax-M2.7agent + office
ModelRelease formParametersContextStrengthsOperational notes
Qwen3-Coder-NextApache 2.080B MoE, 3B active256Kagentic coding, tool calling, local developmentMore lightweight coding-agent candidate than the larger previous generation
DeepSeek-V4-Pro / FlashPublic model card1.6T / 284B MoE1Mlong-context reasoning, coding, agentic tasksPro for quality, Flash for efficiency
GLM-5.1Public weightspublic GLM familylong-context agent workagentic engineering, coding, tool useGLM-family comparison baseline
MiniMax-M2.7Public model card230B-class MoElong agentic workflowcoding, search, office work, long-horizon agentsMiniMax-family agentic candidate
DeepSeek-Coder-V2-LiteDeepSeek16B / 236B family128Ksingle-GPU labs, code edit/completionSuitable for teaching labs, but not for a high-performance baseline

Read operational constraints before benchmark tables. Evaluate candidates in this order.

StepQuestionDisqualifier
LicenseCan it be used in class, research, and commercial PoCs?Redistribution / service terms unclear
Serving supportIs vLLM, SGLang, or Transformers officially supported?Only community patches, no official examples
Tool formatHow are tool calls, JSON mode, and reasoning outputs represented?Parser must be patched ad hoc each time
Context policyDo max context and recommended context differ?“1M context” claim without GPU budget
Failure behaviorHow do OOM, invalid JSON, refusal, and timeout surface?Failures unstructured, retry policy hard to design

The entries in the table are candidates as of the lecture date. The capstone selection criterion is not the model name but the reproducible result of running the same task packet with the same harness.

Qwen3-Coder-Next is an 80B-total / 3B-active MoE model with 256K context, foregrounding agentic coding and tool calling. It is a smaller and faster coding-agent candidate than its larger predecessors and serves as our Qwen-family baseline.

Operationally, Qwen3-Coder is meaningful beyond raw coding performance because long repository context, function-calling format, and vLLM/SGLang deployment support combine to make it a viable candidate for building agentic coding harnesses on local or institutional servers.

DeepSeek-V4-Pro and DeepSeek-V4-Flash are comparison candidates from the DeepSeek V4-family model cards. With 1.6T/284B MoE and 1M context, V4-Pro targets quality and long-context reasoning; V4-Flash targets efficiency and throughput.

Family memberCourse usage
DeepSeek-Coder-V2-LiteEasy-to-run educational model
DeepSeek-V4-ProDeepSeek high-capability comparison baseline
DeepSeek-V4-FlashCost / throughput baseline
Earlier DeepSeek generationsLong-context regression baseline

GLM-5.1 is a public GLM-family candidate for comparing agentic engineering and coding workflows. The lesson is constant: changing the model= value is not sufficient — the harness must understand each model’s tool/reasoning parser and serving options.

MiniMax-M2.7 is a MiniMax-family agentic-workflow candidate, emphasizing search, office work, and long-horizon agentic tasks alongside coding. Provider benchmark numbers are treated as candidate signals, not as final rankings. The Week 10 lab compares them on the same task, harness, budget, and rubric.

Adopting open-weight models is not limited to API cost reduction. There are three real options.

Commercial API

Use Claude, GPT, or Gemini APIs directly. High quality and easy to operate, but data boundary, cost predictability, and rate limits must be managed.

Cloud Agents

Use Codex Web/App, Claude Code on the Web, or GitHub Agent HQ — remote sandbox plus GitHub integration. Strong for asynchronous work and review flow.

Local Inference

Serve OpenAI-compatible APIs through vLLM/SGLang. Strong on data boundaries and predictable GPU cost; operational complexity is higher.

CriterionCommercial APICloud AgentLocal Inference
Time-to-startvery fastfastslow
Data controllow–mediummediumhigh
Long-term costusage-drivenseat / usageGPU fixed cost
Access to newly released modelsfastfastafter model release
Operational difficultylowmediumhigh
Educational valueAPI designworkflow designMLOps / infra understanding
Model Selection Decision Tree
Q1. Can data leave our boundary?
✓ Yes
✗ No→ Local inference vLLM/SGLang
Q2. Need top-tier model quality?
✓ Yes
✗ No→ Commercial API + cache + budget gate
Q3. Interactive IDE or async PR?
InteractiveCommercial API + Claude Code / Codex CLI
Async PRCloud agent — GitHub Agent HQ
Q4. (Local path) Is one model enough?
✓ YesSingle model + admission control
✗ NoRouting gateway

The tree is not gospel — it is the starting point for a written decision. Capture the branch and a one-to-two-sentence rationale in the capstone ADR.

Operating an open-weight model needs an inference server. The course uses vLLM by default, but you should be able to compare candidates.

CriterionvLLMSGLangTGI (Hugging Face)
OpenAI-compatible APIyesyesyes
Continuous batching
Prefix cachingautomaticRadixAttentionpartial
Speculative decodingsome models
Disaggregated prefillexperimental×
Tool / structured output✓ (auto-tool, guided JSON)✓ (constrained decoding)
Operational complexitymidmid–highlow
Communityvery activegrowing faststable
Recommended usedefault labs / capstoneRadixAttention studyquick PoC

To compare commercial APIs and local inference on the same axis, account for both token price and GPU amortization.

ItemFormula
Per-call commercial cost(prompt_tokens × in_price) + (completion_tokens × out_price)
GPU hourly costrental rate or (purchase / depreciation hours) + power
Per-call local costGPU hourly cost × (run_latency_s / 3600) / concurrent calls
Break-even callsGPU hourly cost / average per-call API cost
Example
- Commercial API example: $3/M in, $15/M out → ~$0.012 per call (3K in, 600 out)
- DGX H100 ~ $4.5/hr (instructional-rate assumption), 4 concurrent calls
- Local cost ≈ 4.5 / 4 / (3600/12) ≈ $0.0038
- Break-even: above ~375 calls per hour, local wins

This table can be used in the capstone slide titled “why we chose local.”

AGENTIC CODING TOOL ECOSYSTEM (2026)
Agent Harness
  • Claude Code: subagents, hooks, skills, MCP
  • Codex CLI/App/Web: sandbox, AGENTS.md, MCP, subagents
  • Gemini CLI: 1M context, Google Search grounding, MCP
  • GitHub Agent HQ: choose Claude or Codex inside GitHub
Open-Weight Models
  • Qwen3-Coder-Next
  • DeepSeek-V4-Pro / Flash
  • GLM-5.1
  • MiniMax-M2.7
  • DeepSeek-Coder-V2-Lite
Serving + Control Plane
  • vLLM / SGLang: OpenAI-compatible serving
  • LiteLLM / gateway: routing, budget, policy
  • OpenTelemetry: trace, metrics, logs
  • Agent OS Runtime: event store, contracts, replay

See the AI coding tool selection guide for terminal-based AI CLIs.

Why OpenAI-compatible APIs become the standard interface

Section titled “Why OpenAI-compatible APIs become the standard interface”

vLLM and SGLang both expose OpenAI-compatible /v1/chat/completions (or Responses-style) clients. Using this interface decouples the harness from any single model provider.

from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="local-dev-token",
)
response = client.chat.completions.create(
model="Qwen/Qwen3-Coder-Next",
messages=[
{"role": "system", "content": "You are a careful coding agent."},
{"role": "user", "content": "Add tests for the parser edge cases."},
],
temperature=0.2,
)

The point is the abstraction stack.

LayerResponsibility
Agent CLIfile I/O, command execution, user approval, workflow
Gatewaymodel routing, budget, rate limits, audit
Servingbatching, KV cache, speculative decoding, GPU scheduling
Modeltoken generation, tool-call JSON, reasoning output

OpenAI compatibility is the starting line. To swap models in production, the adapter must absorb the following.

ItemImplementation criterion
Model aliasUse stable course-level names like local-coder, qwen3-coder, glm-5.1
Message templatesystem/developer/user/tool messages must not collide with the model’s chat template
Tool parserPer-model parsing for tool-call JSON, XML-like blocks, and plain-text plans
Retry policyDistinguish timeout, invalid JSON, context overflow, and refusal as separate failure reasons
Token accountingConnect prompt/completion/cache tokens with run_id
Evaluation hookSame task packet must be runnable across commercial and local models

Without this checklist, “model interchangeable” is an unimplemented claim, not an operational property.

# adapter.py — a thin layer that unifies alias, tool parsing, and token accounting
from dataclasses import dataclass
from typing import Callable, Any
@dataclass
class ModelSpec:
alias: str
backend_url: str
backend_model: str
tool_parser: Callable[[str], list[dict]]
chat_template: str | None = None
REGISTRY: dict[str, ModelSpec] = {}
def register(spec: ModelSpec) -> None:
REGISTRY[spec.alias] = spec
def call(alias: str, messages: list[dict], **kw) -> dict:
spec = REGISTRY[alias]
from openai import OpenAI
client = OpenAI(base_url=spec.backend_url, api_key="local")
resp = client.chat.completions.create(
model=spec.backend_model, messages=messages, **kw
)
text = resp.choices[0].message.content
tools = spec.tool_parser(text)
return {
"alias": alias,
"text": text,
"tools": tools,
"usage": resp.usage.model_dump(),
}
  1. Verify the environment

    Terminal window
    nvidia-smi
    python --version
    uv venv .venv
    source .venv/bin/activate
    uv pip install vllm openai
  2. Pick a lab model

    Terminal window
    vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
    --served-model-name local-coder \
    --max-model-len 32768 \
    --port 8000
  3. Test the OpenAI-compatible API

    Terminal window
    curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "local-coder",
    "messages": [
    {"role": "user", "content": "Implement a stable topological sort in Python."}
    ],
    "temperature": 0.2
    }'
  4. Run the same five tasks on two models

    Each team runs five identical tasks against one commercial API and one local model.

    TaskEvaluation criterion
    Bug fixTests pass
    RefactoringBehavior preserved + readability
    Test generationCoverage and edge cases
    Documentation updateFaithful to the change
    Add a CLI featureRequirements met + UX
  5. Measure cost and throughput

    Terminal window
    python -m vllm.benchmarks.benchmark_serving \
    --backend openai-chat \
    --base-url http://localhost:8000 \
    --model local-coder \
    --num-prompts 100 \
    --request-rate 4
  6. Register a model adapter

    Use the adapter.py pattern to register two or three aliases (e.g., local-coder, qwen3-coder, gpt-baseline) and run the same task packet by changing only the alias.

Due: 2026-05-12 23:59

Requirements:

  1. Logs of a vLLM server running on DGX or local GPU
  2. Three or more OpenAI-compatible API call results
  3. Five identical tasks compared between one commercial model and one open-weight model
  4. Tables for throughput (tokens/sec), latency, failure rate, and cost estimate
  5. Final selection: which model for which task, with rationale
  6. Model adapter code (adapter.py) with at least two registered aliases
  1. Open-weight ≠ open-source: only weights are public; training data and code may not be. Read the license first.
  2. Choosing a model is choosing a harness: the same model performs differently across harnesses. Hold the task packet and rubric constant for fair comparison.
  3. Three options coexist: commercial API, cloud agent, and local inference each fit different work types.
  4. vLLM, SGLang, and TGI are complementary: vLLM as default, SGLang for RadixAttention study, TGI for fast PoC.
  5. Cost lives at the break-even line: do not look at token price alone — fold in GPU-hour cost and concurrency.
  6. The model adapter is an operational asset: aliases, tool parsers, token accounting, and retry policies decide whether “swap models” is real.
  7. Reproducible rationale, not model names: the capstone decision document is built from your lab evidence, not vendor benchmarks.

Model cards

Serving infrastructure

Agent tools