Skip to content

Week 10: Open-Source Coding LLMs and Local Deployment

Phase 4Week 10AdvancedLecture: 2026-05-05

Through 2025–2026, open-source coding models reached a level that matches or, on certain benchmarks, surpasses commercial models like Claude and GPT-4o. The MoE (Mixture-of-Experts) architecture has become mainstream, enabling efficient inference by activating only a small fraction of parameters relative to the total.

ModelTotal ParamsActive ParamsContextVRAM Requirement (FP16)License
Gemma 431B (Dense)Full256K~20GB (Q4)Apache 2.0
GLM-5.1UndisclosedUndisclosed198KCloud-onlyMIT
Qwen3-Coder235B (MoE)22B128K~48GB (Q4)Apache 2.0
DeepSeek V3685B (MoE)37B128K350GB+ (Q4, multi-GPU)DeepSeek
GLM-4.7~32BFull128K~24GB (Q4)Apache 2.0
MiniMax M2.1230B (MoE)10B128K~80GB+Apache 2.0
DeepSeek-Coder-V2236B (MoE)21B128K~48GB (Q4)DeepSeek

The latest MoE model specifically fine-tuned for coding and agentic tasks. It delivers performance approaching Claude Sonnet 4 on SWE-bench Verified and provides the strongest coding performance among 32B-class models. Smaller 14B/8B variants deliver practical performance on a single GPU.

A 31B dense model built on the same research behind Gemini 3, with a 256K token context window that allows processing entire codebases in a single prompt. It achieves 80% on LiveCodeBench v6, making it the top-performing open-source coding model, along with 89.2% on AIME 2026, 2150 Codeforces ELO, and 85.2% on MMLU Pro, demonstrating strong reasoning capabilities.

Gemma 4 supports native function calling at the model level, simplifying integration with agentic tool-calling pipelines. Lighter variants include a 26B MoE model and E2B/E4B edge models (128K context). It can be deployed via Ollama on the cloud (NVIDIA Blackwell GPUs) or locally. Licensed under Apache 2.0 with no restrictions on commercial use or fine-tuning.

The successor to GLM-4.7, specifically designed for long-horizon agentic tasks. MIT licensed with 198K context support. It scores 58.4 on SWE-Bench Pro, surpassing GPT-5.4 (57.7) and Opus 4.6 (57.3), and achieves 69.0 on Terminal Bench 2.0 and 68.7 on cybersecurity benchmarks, both top-tier results.

Its key differentiator is long-horizon execution capability. It can perform 600+ iterations and 6,000+ tool calls in a single session, with performance improving as runtime increases. Currently available via Ollama Cloud and the Z.AI API.

685B MoE, top-tier in math, reasoning, and coding. However, even quantized it requires 350GB+ of VRAM, making an 8×H100-class cluster essentially mandatory. The strongest open model if you have datacenter-scale infrastructure.

Dense model with ~32B parameters, runnable on a single 48GB GPU. Its Interleaved Thinking feature delivers high reasoning quality, and it is evaluated at Claude-level on coding benchmarks. Weights are publicly available on HuggingFace/ModelScope. Its successor GLM-5.1 shows significant performance gains on long-horizon agentic tasks (see the GLM-5.1 section above).

230B MoE, 10B active. Specifically designed for coding agents and tool use. Fully open weights. While the active parameter count is small thanks to MoE, loading the full model requires 80GB+ of VRAM.

EnvironmentGPU VRAMRecommended Models
Personal PC (RTX 4090)24GBGemma 4 E4B, Qwen3-Coder 14B/8B, GLM-4.7 (Q4)
Workstation (A6000/H100 ×1)48–80GBGemma 4 31B, Qwen3-Coder 32B, GLM-4.7 (FP16)
DGX H100 (MIG 2–4 slices)160–320GBDeepSeek-Coder-V2, MiniMax M2.1
DGX H100 (full 8 GPUs)640GBDeepSeek V3 (Q4)

GLM-5.1 is currently available only via Ollama Cloud and the Z.AI API.

Commercial API (Claude, GPT-4)

  • Pros: Ready to use immediately, no maintenance required
  • Cons: Data privacy concerns, unpredictable costs, API limits
  • Cost: ~$15/1M tokens (input)

Open-Source (DeepSeek + vLLM)

  • Pros: Full control, data kept in-house, predictable costs
  • Cons: Initial setup cost, maintenance required
  • Cost: H100 server costs only (~$0.001/token)
AGENTIC CODING TOOL ECOSYSTEM (2026)
Commercial
  • Claude Code (Anthropic)
  • Gemini CLI (Google, free)
  • Codex CLI (OpenAI)
  • Cursor (GPT-4o)
  • GitHub Copilot
  • Amazon Q
Open Weights Models
  • Gemma 4 (Google, 31B Dense, 256K)
  • GLM-5.1 (Z.AI, MIT, 198K)
  • Qwen3-Coder (Alibaba, 235B MoE)
  • DeepSeek V3 (685B MoE)
  • GLM-4.7 (Zhipu AI, ~32B Dense)
  • MiniMax M2.1 (230B MoE)
  • DeepSeek-Coder-V2 (236B MoE)
  • Qwen3 14B/8B (lightweight)
Open Tools
  • OpenCode (multi-backend TUI)
  • Roo Code / Cline (VS Code extensions)
  • vLLM / SGLang (inference servers)
  • Ollama (local/cloud deployment)

For a comparison of terminal-based AI coding CLI tools, see the AI Coding Tool Selection Guide.

  1. Install vLLM on the DGX Server

    Terminal window
    # Inside a MIG slice
    pip install vllm
  2. Select a Model and Start the Server

    Terminal window
    python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-Coder-32B-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --port 8000
  3. Test with the OpenAI-Compatible API

    import openai
    client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc"
    )
    response = client.chat.completions.create(
    model="Qwen/Qwen3-Coder-32B-Instruct", # name of the running model
    messages=[{"role": "user", "content": "Implement quicksort in Python"}]
    )
    print(response.choices[0].message.content)
  4. Use a Local Model in an AI Coding CLI

    vLLM’s OpenAI-compatible API allows tools like Claude Code and OpenCode to use a local model as a backend.

    Terminal window
    # Connect local vLLM in OpenCode
    export OPENAI_API_BASE="http://localhost:8000/v1"
    export OPENAI_API_KEY="token-abc"
    opencode
  5. Performance Benchmarking

    Terminal window
    # Throughput benchmark
    python -m vllm.benchmarks.benchmark_throughput \
    --model Qwen/Qwen3-Coder-32B-Instruct \
    --num-prompts 100 \
    --input-len 512 \
    --output-len 128

While vLLM is suited for high-throughput production environments, Ollama offers a simpler setup for development and prototyping. Through its partnership with NVIDIA, Ollama can also run models on cloud GPUs, enabling access to large models without local hardware.

  1. Install Ollama

    Terminal window
    # macOS
    brew install ollama
    # Linux
    curl -fsSL https://ollama.com/install.sh | sh
    # Verify installation
    ollama --version
  2. Run Cloud Models (No GPU Required)

    Ollama Cloud performs remote inference on NVIDIA Blackwell GPUs. You get full model performance without local GPU hardware.

    Terminal window
    # Gemma 4 31B Cloud — 256K context auto-configured
    ollama pull gemma4:31b-cloud
    ollama launch claude --model gemma4:31b-cloud
  3. Run Local Models

    Choose a model size that matches your hardware.

    Terminal window
    # Edge model (10GB+ VRAM) — runs on laptops
    ollama pull gemma4:e4b
    ollama launch claude --model gemma4:e4b
    # 26B MoE (18GB+ VRAM)
    ollama pull gemma4:26b
    ollama launch claude --model gemma4:26b
    # 31B Dense (20GB+ VRAM) — maximum quality
    ollama pull gemma4:31b
    ollama launch claude --model gemma4:31b
  4. Connect to AI Coding CLIs

    Ollama provides an OpenAI-compatible API at localhost:11434.

    Terminal window
    # Use Ollama backend in OpenCode
    export OPENAI_API_BASE="http://127.0.0.1:11434/v1"
    export OPENAI_API_KEY="ollama"
    opencode
  5. vLLM vs Ollama Comparison

    AspectvLLMOllama
    Setup complexityRequires CUDA/Python environmentSingle command
    Batch processingHigh throughput (PagedAttention)Single-request optimized
    Cloud deploymentManual server configurationOllama Cloud (NVIDIA partnership)
    Model managementManual download from HuggingFaceollama pull auto-management
    Best forProduction, high concurrencyDevelopment, prototyping, personal use

Submission deadline: 2026-05-12 23:59

Requirements:

  1. Screenshot confirming successful deployment of at least one open-source model with vLLM or Ollama
  2. Coding performance comparison between an open-source model (Gemma 4, Qwen3-Coder, or your chosen model) and Claude (5 identical tasks)
  3. Throughput (tokens/sec) benchmark results
  4. Cost analysis: API cost vs DGX operating cost calculation
  5. (Optional) Compare vLLM and Ollama deployment experience — setup difficulty, performance, flexibility