Skip to content

Week 1: The AI Systems Paradigm Shift

Phase 1Week 1BeginnerLecture: 2026-03-03

Conceptual Perspective

Explain the difference between AI models and AI systems using the EU AI Act definition, and apply the “engine vs. automobile” analogy.

Governance Perspective

Distinguish between the three stages of HITL → HOTL → HIC architecture using real-world examples.

Industry Perspective

Understand the landscape of the 2025–2026 agentic AI tools ecosystem and the key benchmarks.

Methodology Perspective

Grasp the core principles of harness engineering and survey the full arc of the 15-week course.


The AI industry reached a fundamental inflection point in 2025–2026. The era of using LLMs as simple text-generation tools has ended, and autonomous agentic systems have become mainstream in software development.

Before (2023–2024)

  • LLM = code suggestion tool (Copilot)
  • Humans review and approve all code
  • AI as a passive “oracle”
  • SWE-bench Verified success rate ~5%

After (2025–2026)

  • LLM = autonomous execution agent
  • Humans as strategic supervisors
  • AI directly modifies files, runs tests, and deploys
  • SWE-bench Verified success rate 79.2% (Claude Opus 4.6 Thinking)

The numbers make the scale of the transition clear:

  • SWE-bench Verified: ~5% in 2023 → 79.2% in 2026. The ability to autonomously resolve real GitHub issues grew 16× in just three years.
  • METR “Moore’s Law”: The time horizon of tasks that AI agents can handle is doubling approximately every seven months.
  • Gartner forecast: By mid-2026, 40% of enterprise applications are projected to embed AI agents.
  • Timeline: GitHub Copilot launch (2021) → autonomous agent era begins (2025) → integration into production pipelines (2026).

AI Models vs AI Systems — The Engine and the Automobile

Section titled “AI Models vs AI Systems — The Engine and the Automobile”

“AI models” and “AI systems” are different things. Understanding this distinction is the first gate of this course.

EU AI Act (2024) definition: An AI system is “a machine-based system that operates with varying levels of autonomy and that, based on machine- or human-provided inputs, infers from its inputs how to generate outputs such as predictions, recommendations, decisions, or content” (Article 3).

NIST AI RMF 1.0 definition: “An engineered system that generates predictions, recommendations, or decisions for a given set of objectives.”

One analogy to keep in mind: a model is an engine, a system is an automobile. The engine (LLM) provides the power, but it alone cannot reach a destination. You need steering (planning), brakes (safety), navigation (memory), and a dashboard (observability) to make it an automobile.

The 7 components of an AI system:

#ComponentRoleAnalogy
1Foundation ModelReasoning engineEngine
2Tool Use / APIsExternal action capabilityWheels and steering
3MemoryShort-term context + long-term knowledgeBlack box + GPS history
4Planning / ReasoningTask decomposition, goal orderingNavigation
5Execution EnvironmentSandbox, containersRoads and lanes
6Safety / GuardrailsPolicies, constraints, human oversightBrakes and airbags
7ObservabilityLogging, evaluation, feedback loopsDashboard and dashcam

This is why the EU AI Act regulates models and systems under separate rules. The safety of an engine alone (model regulation) and the safety of the whole automobile (system regulation) are evaluated on different criteria. No matter how good the engine, a car without brakes is dangerous.

Let’s establish the theoretical roots of this course.

Rich Sutton’s Bitter Lesson (2019): “General methods that leverage computation are ultimately the most effective, and by a large margin.”

This principle explains the two axes of AI progress:

  • Pre-training scaling (GPT-3 → GPT-4): The “learning” side of the Bitter Lesson. More data, larger models.
  • Test-Time Compute Scaling (o1, DeepSeek R1): The “search” side. Making already-trained models think longer at inference time.

The Ralph Loop and autoresearch we cover in Week 4 are external loop implementations of test-time compute. Instead of thinking deeper inside the model, the external harness repeatedly calls the model and verifies results. The principle is the same, but control belongs to us.


HITL → HOTL → HIC Governance Architecture

Section titled “HITL → HOTL → HIC Governance Architecture”

Detailed Comparison of the Three Architectures

Section titled “Detailed Comparison of the Three Architectures”
HITL (Human-in-the-Loop)HOTL (Human-on-the-Loop)HIC (Human-in-Command)
Human roleSequential gatekeeperReal-time monitoring, exception interventionSetting strategy and boundary conditions
AI autonomyLow — approval required at each stepMedium — autonomous execution, alerts on anomaliesHigh — tactical execution delegated
SpeedLow (human is the bottleneck)High (parallel processing possible)Maximum (asynchronous work possible)
Risk levelLowest (all actions verified)Medium (monitoring gaps possible)Context-dependent
Regulatory requirementEU AI Act high-risk baselineTelemetry + audit logs mandatoryDocumentation of boundary conditions required
Real-world exampleProduction DB migrationCI/CD pipeline, AI code generationEnterprise AI strategy, this course’s Ralphthon
AnalogyManual transmissionCruise controlSelf-driving level 4

The three architectures are not mutually exclusive. Within a single system, different levels apply depending on the risk level of each task. For example, with the same AI coding agent:

  • Reading files → HOTL (autonomous execution, log only)
  • Writing files → HITL (execute after human approval)
  • Deciding overall project architecture → HIC (human sets direction)

The EU AI Act (2024) has become the global standard for AI governance. Let’s cover the key provisions.

Article 14 — Human Oversight obligations:

  1. Preventing automation bias: Design to prevent human supervisors from uncritically accepting AI outputs
  2. Ability to disregard AI output: Supervisors must be able to override or reverse AI decisions at any time
  3. Emergency stop mechanism: High-risk systems must include a means of immediate shutdown

Related international standards are also being rapidly established:

  • ISO/IEC 42001: International standard for AI Management Systems (AIMS) — the ISO 9001 of AI governance
  • NIST AI Agent Standards Initiative (launched 2025–2026): Standards work dedicated to agent systems

South Korea’s AI Framework Act (effective January 2026) is more innovation-friendly than the EU AI Act, but shares the principle of human oversight for high-risk AI. We’ll compare the implementation-level differences between the two laws in detail in Week 2.

Governance Necessity Seen Through Real Data

Section titled “Governance Necessity Seen Through Real Data”

Let’s look at the need for governance not through abstract principles, but through data.

METR Study (July 2025, arXiv 2507.09089): A randomized controlled experiment with 16 skilled developers completing 246 real-world tasks. The results were surprising.

  • Using AI tools made task completion 19% slower (while participants perceived themselves as 20% faster)
  • A 39 percentage point gap existed between developers’ self-assessments and actual performance

The core cause of this phenomenon is the “Babysitting Tax.” The cost of reviewing, fixing, and debugging AI-generated code exceeded the cost of writing it directly. A 2025 CodeRabbit analysis also found that AI-generated PRs had ~1.7× the issue rate.

Anthropic’s 2026 report shows a balanced picture: developers use AI for 60% of their work, but fully delegated work (AI from start to finish) accounts for only 0–20%.


The Agentic AI Tools Ecosystem 2025–2026

Section titled “The Agentic AI Tools Ecosystem 2025–2026”

The AI coding tools market has grown to USD 34.58B as of 2026. It divides into three categories by architecture:

Terminal-native — suited for headless automation:

  • Claude Code (Anthropic) — deep integration with the MCP ecosystem, autonomous loop execution via /loop
  • Gemini CLI (Google) — direct access to Gemini models, OAuth authentication
  • Codex CLI (OpenAI) — built-in sandboxed execution environment
  • OpenCode — open-source, multi-model support

AI-native IDEs — editor integration, visual context:

  • Cursor — VS Code fork, codebase indexing
  • Windsurf — agentic IDE, flow-based tasks

Cloud-native — remote execution, asynchronous tasks:

  • Codex (OpenAI cloud) — asynchronous task execution in cloud sandbox
  • Devin (Cognition) — full-stack autonomous agent

The top 3 tools (GitHub Copilot, Claude Code, Cursor) hold 70%+ market share, and MCP (Model Context Protocol) has emerged as the common protocol for tool integration: 97M+ monthly SDK downloads, 6,400+ registered servers.

Knowing the benchmarks for actual agent capabilities helps you distinguish real performance from inflated marketing claims.

SWE-bench Verified — benchmark for autonomously resolving real GitHub issues:

  • Early 2024: top systems ~15%
  • Late 2025: surpassing 50%
  • Early 2026: 79.2% (Claude Opus 4.6 Thinking)

Real-world data is also accumulating:

  • Devin: Initial 15% success rate (Answer.AI test) → 67% PR merge rate by late 2025
  • Factory.ai: Deployed to 5,000+ EY engineers, validated at production scale
  • Rakuten: 7 hours of autonomous work on a 12.5M line codebase, reporting 99.9% accuracy

A notable 2026 development is that open-source models have approached commercial-level quality. Of the 204 AI coding tools currently tracked, 95% are open-source.

Key open-source coding models:

  • Qwen3-Coder (235B MoE, 22B active) — approaches commercial models on SWE-bench, Apache 2.0
  • DeepSeek V3 (685B MoE, 37B active) — top-tier for math, reasoning, and coding with maximum cost efficiency
  • GLM-4.7 (32B Dense) — can run on a single GPU, Interleaved Thinking

Why we deploy open-source models in this course: cost, privacy (campus data stays local), and customizability (tuned to our learning environment). In Weeks 10–11 we deploy these models on the DGX H100 server.

From “Prompt Engineering” to “Systems Engineering”

Section titled “From “Prompt Engineering” to “Systems Engineering””

The way we work with AI is rapidly evolving:

PeriodParadigmCore Skill
2023Single promptPrompt engineering
2024RAG pipelinesRetrieval-augmented generation, vector DBs
2025Agent systemsTool use, multi-agent
2026Harness engineeringLoops, governance, observability

Andrej Karpathy’s “Software 3.0” vision summarizes this shift: programmers move from directly writing code to becoming conductors of AI agents. Instead of playing instruments yourself, you lead the orchestra.

The required skills are also shifting:

  • Prompt writingSystem architecture (what components to connect and how)
  • Model selectionEvaluation and observability (how do you know the system works correctly)
  • API callsContext engineering (what to show AI, when, and how much)
  • Result reviewSafety engineering (how to respond when the system fails)

Harness Engineering — The Core Hypothesis of This Course

Section titled “Harness Engineering — The Core Hypothesis of This Course”

Harness = A Deterministic Shell Around Non-Deterministic AI

Section titled “Harness = A Deterministic Shell Around Non-Deterministic AI”

Geoffrey Huntley’s core insight: “Build a stronger harness, not a stronger model.”

The Ralph Loop prototype is surprisingly simple:

Terminal window
while :; do cat PROMPT.md | <ai-coding-cli>; done

This one line works because of two mechanisms:

  • Backpressure: Upstream (structured specs, deterministic context) and downstream (tests, linters, type checkers) reject incorrect output. If the agent produces broken code, tests fail, and the loop automatically retries.
  • Garbage Collection: git checkout . completely removes failed attempts. The context is not contaminated by the residue of failures.

OpenAI applied the same principle at scale. In a case study where Codex agents wrote 1M+ LOC without manual typing, the key was the harness, not the model: specs-as-code to enforce specifications, automatic layer architecture validation, and GC to restore a clean state on failure.

15-Week Course Roadmap — Through the Lens of the Harness

Section titled “15-Week Course Roadmap — Through the Lens of the Harness”
PhaseWeeksTopicHarness Layer
Phase 11–3Governance, infrastructure, protocolsSafety layer — brakes and airbags
Phase 24–6Loops, context management, instruction tuningControl loop — engine and transmission
Phase 37–9Role assignment, planner, QAOrchestration — navigation and autonomy
Phase 410–12Model deployment, evaluationInfrastructure and observability — dashboard and maintenance
Phase 513–16Ralphthon capstoneReal-world validation — the driving test

In Phase 1 we build the safety systems first, then in Phase 2 we add the engine. Order matters — accelerating without brakes is an accident.

After 15 weeks, you will have fully implemented the following system:

AGENTIC SDLC PIPELINE
Human (HIC)Strategy setting · Week 1
Planner AgentSpec (spec.md) generation · Week 8
Coder Agent (Ralph Loop)Code writing + automated testing · Weeks 4–6
QA AgentIndependent verification + regression testing · Week 9
Deploy AgentAutomated deployment · Weeks 10–11

  1. Have you used AI coding tools before? What tasks were they most effective for, and where did they fail?
  2. In the METR study, skilled developers became 19% slower with AI. Why? What kind of system is needed to reduce the “Babysitting Tax”?
  3. Among HITL, HOTL, and HIC, which architecture is most suitable for our course lab environment? How do we minimize student errors while maximizing learning outcomes?
  4. Why does the EU AI Act regulate AI “models” and AI “systems” separately? Is the LLM itself dangerous, or is it the system wrapping the LLM that is dangerous?
  5. “Stronger model vs stronger harness” — if you had the same budget, would you invest in a model upgrade or harness improvement?

  1. Install Node.js 20 LTS

    Terminal window
    # macOS (Homebrew)
    brew install node@20
    # Ubuntu/Debian
    curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
    sudo apt-get install -y nodejs
  2. Install AI coding CLI tools → Set API keys → Test run

    Terminal window
    # 1. Install (choose one)
    brew install claude-code # Homebrew (recommended)
    pnpm add -g @anthropic-ai/claude-code # pnpm
    # 2. Set API key (add to ~/.bashrc or ~/.zshrc)
    export ANTHROPIC_API_KEY="sk-ant-..."
    # 3. Test run
    mkdir ~/test-project && cd ~/test-project
    claude "Hello! Please create a simple Python hello world file in this directory."
  3. Observation exercise: Record AI autonomous decisions

    Open the code the AI generated and observe the following:

    • What did the AI decide on its own? (file names, function structure, variable names, comment language, etc.)
    • Did those decisions match your intent?
    • Does re-running the same prompt produce a different result? (verifying non-determinism)

    This observation is the starting point of harness engineering — how do we control the non-deterministic output of AI in a deterministic way?

The DGX server is protected by Cloudflare Zero Trust. You must install the WARP client and log in before connecting.

  1. Install Cloudflare WARP

    Download the client for your operating system from the Cloudflare WARP download page.

  2. Log in to Zero Trust

    Open WARP → click the settings gear → Preferences → AccountLogin to Cloudflare Zero Trust → enter the team name → log in with your school email

  3. SSH connection

    With WARP connected, open a terminal and connect.

    Terminal window
    ssh {USER}@{SERVER_IP} -p {PORT}
    # Initial password is your student ID — change it on first login!

Due: 2026-03-10 23:59

Submission path: assignments/week-01/[student-ID]/ via PR

Requirements:

  1. Screenshot of AI coding CLI tool installed and version output
  2. Screenshot of successful SSH connection to DGX server
  3. hello_agent.py — a simple Python file generated with an AI coding CLI
  4. README.md — document any problems encountered during setup and how you solved them
  5. AI system analysis report (300 words): Judge whether the AI coding tool you used is an “AI model” or an “AI system,” and analyze which of the 7 components are present

Bonus:

  1. Compare the top 5 systems on the SWE-bench Verified leaderboard and analyze the correlation between model size and performance
  2. Propose how EU AI Act Article 14 human oversight requirements could be applied to our course environment

Grading criteria:

  • Environment setup complete (40 points)
  • DGX connection confirmed (20 points)
  • Troubleshooting record (20 points)
  • AI system analysis report (20 points)
  1. AI system ≠ AI model: The model is the engine, the system is the automobile. Tool use, memory, planning, execution environment, safety guardrails, and observability make up the system.
  2. SWE-bench 5% → 79%: Agent capability has grown explosively in three years. The problem is not capability — it’s control.
  3. HITL → HOTL → HIC: From manual transmission to self-driving. The human role shifts from “approving every step” to “setting strategy.”
  4. 19% slower: AI tools speed up individual tasks, but review costs (Babysitting Tax) cancel out the gains. The harness resolves this bottleneck.
  5. Harness engineering: A stronger harness, not a stronger model. Backpressure and garbage collection make non-deterministic AI deterministic.
  6. The 15-week arc: Governance → Infrastructure → Loops → Multi-agent → MLOps → Ralphthon. This course builds a complete agentic SDLC pipeline.

Week 2 covers the concrete implementation of HOTL governance and EU AI Act compliance requirements. In particular, we practice “Governance-as-Code” — how to enforce governance policies through code.