Week 4: Loop Paradigm — Iteration Beats Complexity

Phase 2Week 4IntermediateLecture: 2026-03-24

Theory

Today’s Learning Objectives

Theory Perspective

Explain why test-time compute scaling is the theoretical foundation of the loop paradigm.

Ralph Perspective

Understand and implement the working principles of harness, backpressure, garbage collection, and cumulative learning.

Extension Perspective

Understand that RLM (recursive reasoning) and autoresearch (autonomous experimentation) are different applications of the same loop principle.

Implementation Perspective

Implement a Ralph loop yourself and experience the failure → learning → success cycle firsthand.

Why Loops — The Decisive Edge in 2026 AI

A single loop running overnight delivers better results than a complex pipeline.

The decisive edge in 2026 AI engineering is neither a bigger model nor more sophisticated prompts. It is loop iteration. Run the same model hundreds of times, verify the result each time, discard failures, and accumulate successes. Simple — yet this pattern consistently beats complex multi-agent architectures.

The three loops covered this week are proof:

Ralph Loop — generate code, verify with tests, discard failures with git checkout ., and retry. The code-quality loop.
RLM — the model recursively calls itself to extract only the relevant information from long documents. The context-comprehension loop.
autoresearch — hand train.py to an agent, measure results every 5 minutes, commit if improved, reset if not. The research loop.

All three loops rest on the same principle. The governance learned in Week 2 sets the safety boundaries for loops; the MIG isolation and MCP tool standardization learned in Week 3 provide the infrastructure. Today we understand and implement the loops that actually run on top of that infrastructure.

Test-Time Compute Scaling — Theoretical Foundation

Boosting Performance Without Growing the Model

There is a key finding validated by OpenAI o1 in 2024: even without increasing model size, performance improves when more compute is spent at inference time. This is test-time compute scaling.

The traditional AI performance strategy was train-time scaling — larger models, more data, longer training runs. The jump from GPT-3 to GPT-4 is the canonical example. But this approach incurs exponentially increasing costs.

Test-time compute scaling takes a different path. It makes an already-trained model think longer and more during inference. Specifically:

o1 explores multiple reasoning paths for the same problem and selects the best answer
The more tokens invested in reasoning (= the longer it thinks), the higher the accuracy
Up to a point, increasing inference compute is more cost-effective than scaling model size

How the Three Loops Leverage Test-Time Compute

Ralph, RLM, and autoresearch are all concrete realizations of test-time compute scaling:

Loop	Application Domain	How Test-Time Compute Is Used
Ralph Loop	Code generation	Generate code → run tests → retry on failure. 10 attempts = 10× the inference tokens
RLM	Long-context understanding	Model recursively calls itself. Inference compute grows with recursion depth
autoresearch	ML experimentation	5-minute budget × N iterations. Compute invested scales with iteration count

See the pattern? All three call the same model repeatedly while filtering results through deterministic verification conditions. Rather than changing the model itself, they increase the time and number of times the model thinks to ensure quality.

The “22-Point Harness Swing” — The Harness Matters More Than the Model

The 2026 SWE-bench benchmark provides empirical proof of this lecture’s central claim:

Benchmark	Claude Opus 4.5 Score	Description
SWE-bench Verified	80.9%	Validated issue set, standard scaffold
SWE-bench Pro	45.9%	More realistic issues, minimal scaffold

The same model yields a 35 pp gap purely from the difference in harness/scaffold. GPT-5.3-Codex leads on Pro at 56.8%, but that too is thanks to OpenAI’s own agent harness.

T2 Scaling: Inference Cost Changes Pretraining Strategy

The T2 scaling laws published in April 2026 (arxiv 2604.01411) dig one level deeper into the economics of test-time compute. The original Chinchilla work (2022) optimized only training compute costs; T2 recalculates the optimal point including inference costs.

Key findings:

When inference costs are factored in, overtraining beyond the Chinchilla optimum is advantageous — training a smaller model longer reduces the per-token cost at inference time
This directly connects to the loop paradigm: running a small model many times can be more economical than running a large model a few times
A feedback loop in which test-time compute scaling works backwards to change train-time strategy

The Evolution of the `effort` Parameter

In Claude 4.x, the old budget_tokens (where the developer specifies the token count directly) was deprecated and replaced by an adaptive effort parameter:

effort level	Behavior	Use case
`low`	Short reasoning, fast response	Simple lookups, type checks
`medium`	Standard reasoning	General coding
`high`	Deep reasoning, more tokens	Complex refactoring
`max`	Maximum reasoning depth	Architecture design, debugging

This reflects the trend of test-time compute allocation moving from developer control to model-autonomous control. Loop paradigm implication: a strategy of dynamically adjusting effort each iteration is possible — explore quickly with low early on, then drill deep with high once a promising direction emerges.

Ralph Loop Deep Dive

The Ralph Loop (Ralph Wiggum Loop), popularized by developer Geoffrey Huntley in late 2025, is the core paradigm of agentic software development.

The Archetype: One Line of Bash

# The essence of the Ralph loop — a single line
while :; do cat PROMPT.md | claude-code; done
# The same pattern works with other AI coding CLI tools:
# while :; do cat PROMPT.md | gemini; done
# while :; do codex --approval-mode full-auto "$(cat PROMPT.md)"; done

Two design decisions run through this infinite loop:

A Stop Hook blocks exits — even if the agent tries to terminate itself, the shell restarts it. The agent believes it has “finished,” but the next loop re-reads PROMPT.md and finds incomplete tasks.
A fresh context window opens every iteration — failed attempts from previous loops do not contaminate the context. Git history and the filesystem are the only state stores.

This simplicity is the point. No inter-agent communication, no state DB, no orchestrator. The intelligence of the loop lies not in the loop itself but in the environmental constraints — this is the harness.

The Advantage of Monolithic Architecture

The Ralph loop is intentionally monolithic:

Microservices (Complex)

Communication errors between agents
Non-deterministic architectural failures
Undebuggable cascading failures

Ralph Loop (Simple)

Single process, single repository
Exactly 1 task per loop
Predictable execution state

Harness Architecture

RALPH HARNESS

📄PROMPT.md

↓

🤖AI AgentCode generation attempt

↓

Backpressure System

Run compiler
Run type checker
Run test suite

✓SuccessRecord in AGENTS.md + next task

✗FailureGC + record failure in AGENTS.md + retry

The three pillars of the harness:

Backpressure — the compiler, type checker, and test suite automatically reject agent output. No matter how confidently the agent writes code, if pytest fails, that code is as good as nonexistent.
Garbage Collection — git checkout . removes failed code. The repository always returns to the last successful state instead of accumulating new attempts on top of failed changes.
State Tracking — the checklist in PROMPT.md and git history record progress. The agent in the next loop can assess where the previous loop left off.

Failure Accumulation vs State Reset

Failure accumulation: failed changes and reasoning history continue to affect the next attempt
Ralph loop: remove failed changes, preserve only useful lessons in the state file, then retry

Cumulative Learning Structure — AGENTS.md

The evolved form of the Ralph loop goes beyond simple repetition to a loop that learns. The key is the AGENTS.md file.

# AGENTS.md — Cumulative Learning Record

## Learned Patterns
- The division function in calculator.py requires ZeroDivisionError handling (failed in loop 3)
- pytest fixtures must be placed in conftest.py to prevent import errors (failed in loop 5)

## Forbidden Patterns
- Do not use `eval()` — security risk + bypasses type checker
- Do not share state via global variables — breaks test isolation

## Current Status
- [x] add() implementation complete
- [x] subtract() implementation complete
- [ ] multiply() in progress

At the end of each loop the agent records what it learned in this loop in AGENTS.md. The next loop’s agent reads this file before starting, so it does not repeat the same mistakes.

This is fundamentally different from fine-tuning:

Fine-tuning: modifies model weights. Expensive, slow, and hard to reverse.
AGENTS.md: written to a text file. Reflected immediately, tracked with git, readable by anyone.

“Failure itself is information” — an agent’s failed attempt is “deterministically bad,” and this information becomes the input for the next loop. If the same task fails 10 or more times in a row, it is judged stuck, and the task is split into smaller units before retrying.

Practical consensus on writing AGENTS.md has formed in the Ralph loop community:

Keep it under 300 lines — too long and it consumes context window, actually degrading performance
Don’t put in CLAUDE.md what a linter can handle — rules like “don’t forget semicolons” belong in ESLint; AGENTS.md should record only the domain knowledge linters can’t catch
Auto-update mechanism: harness.sh automatically appends test failure patterns to AGENTS.md

# Auto-record failure patterns in harness.sh
if [ $EXIT_CODE -ne 0 ]; then
  ERROR_PATTERN=$(tail -5 test_output.log | head -1)
  echo "- Anti-Pattern: ${ERROR_PATTERN} (loop #${ITERATION})" >> AGENTS.md
fi

This is the mechanical implementation of “cumulative learning.” Rather than a human manually recording each time, the harness extracts patterns from failures and accumulates them automatically.

Fresh Context vs Continuous Session — Huntley’s Choice

In January 2026, Huntley and Anthropic presented different approaches to Ralph loop iteration. Anthropic proposed a stop-hook plugin for the Ralph loop that used a continuous session approach (repeating while maintaining the same context).

Approach	Method	Advantages	Disadvantages
Original Ralph	New session every loop	Context Rot completely prevented, deterministic	Loses context from previous attempts
Anthropic plugin	Iterate within continuous session	Remembers previous attempts, faster convergence	Context Rot risk, non-deterministic

Huntley chose fresh context. His reasoning was that repeated failures in a continuous session can fill the context with failure history and make useful reasoning harder. The alternative is to record only the essential lessons in AGENTS.md and start each iteration with a fresh context.

This trade-off connects directly to Week 5’s Context Rot — Ralph’s fresh context strategy is the first solution to Context Rot.

Claude Code `/loop` — Official Automation Loop

The /loop command officially shipped in Claude Code by Anthropic in 2026 implements the Ralph loop philosophy at the product level. Instead of a manual while shell script, a single line spins up a schedule-based autonomous agent.

# Basic syntax
claude /loop "<instruction>" --every <interval> --for <duration>

# Example: find and fix failing tests every 2 hours, for up to 3 days
claude /loop "check for failing tests and fix them" --every 2h --for 3d

Core design principles:

Element	Description
Worktree isolation	Every iteration runs via `git worktree` without affecting the main branch
CLAUDE.md = control plane	CLAUDE.md is read every cycle, so modifying the instruction file changes the behavior of a running loop
3-day expiry	Maximum `--for 3d`. Intentional design to prevent context drift in forgotten autonomous agents
3 flags	`"instruction"`, `--every`, `--for` — this is the entire API surface

Why the 3-day expiry is a feature: if a loop set on Tuesday runs through Friday, by then 15 PRs merged by the team will conflict with it. An agent confidently patching with stale context creates problems harder to debug than the original bug. Re-evaluating and restarting every 72 hours is the safe approach.

Three validated workflows:

claude /loop "check open PRs on the current branch. If CI is failing,
read the error logs, fix the issue, and push. If CI passes and the PR
has no requested changes, post a comment saying 'Ready for human review.'
Summarize what you did in the PR description." --every 30m --for 2d

CI pipeline monitoring, automatic lint/type error fixes, notification when ready. Cannot handle failures requiring business logic judgment.

claude /loop "run pnpm audit. If any high or critical vulnerabilities exist,
create a branch, update the affected packages, run the test suite, and
open a PR if tests pass. Include the vulnerability details in the PR body."
--every 4h --for 3d

Clear success criteria (tests pass) and minimal business context — an ideal task for /loop.

claude /loop "summarize all commits merged to main in the last 24 hours.
Include: PR titles, authors, files changed count, and any test coverage
changes. Write the summary as a Markdown standup update and save it to
./reports/standup-$(date +%Y-%m-%d).md" --every 24h --for 3d

Automates a manual process. File output provides basic functionality even without Slack MCP integration.

When /loop fails:

Ambiguous tasks: “refactor this module to be more maintainable” — interprets “maintainable” differently each iteration
Context drift: divergence between the worktree branch point and the current main in a fast-moving codebase
Cost: 30-minute interval × 3 days = 144 iterations. If each iteration processes substantial context, API costs accumulate

RLM — Recursive LM Calls

The Problem: LLMs Miss the Middle of Long Documents

Even with a 200K-token context window, when a long document is passed to an LLM, understanding of the middle sections degrades compared to the beginning and end (the “Lost in the Middle” phenomenon). The existing solutions were two:

RAG (Retrieval-Augmented Generation): chunk the document and retrieve only relevant pieces via search. But you need to know in advance which chunks are relevant — information loss is inevitable.
Summarization: summarize the long document before passing it in. But detail is lost in summarization.

RLM’s Solution: The Model Recursively Calls Itself

RLM (Recursive Language Model) takes a fundamentally different approach. It loads the long prompt into Python REPL variables, then has the model write code to extract only the needed portions and recursively call itself.

The core idea: instead of enlarging the context window, let the model decide for itself how to read the context.

# RLM pseudocode — recursive call pattern
def rlm_solve(question: str, documents: list[str]) -> str:
    """The model recursively calls itself to process long documents."""

    # Step 1: load all documents into Python variables
    context_vars = {f"doc_{i}": doc for i, doc in enumerate(documents)}

    # Step 2: ask the model to write code that decides "which parts to read"
    planning_code = llm_call(
        f"Write Python code to determine which parts of {len(documents)} documents "
        f"need to be read to answer the following question.\n"
        f"Question: {question}"
    )

    # Step 3: execute the code → extract only relevant parts
    relevant_parts = execute(planning_code, context_vars)

    # Step 4: recursively call itself with the extracted parts
    if fits_in_context(relevant_parts):
        return llm_call(f"Question: {question}\nContext: {relevant_parts}")
    else:
        # Still too large — recurse again
        return rlm_solve(question, split(relevant_parts))

RLM Results and Significance

The 2025 results from Zhang et al. are striking: GPT-5-mini + RLM achieved more than 2× the performance of GPT-5 alone on the OOLONG benchmark. A smaller model outperformed a larger model purely through recursive calls.

Why this is possible:

No information loss: unlike RAG, the full original document is in a variable. The model can access it at any time.
Traceable: the recursion trajectory is left as code. You can trace why the model read a particular section and what logic led it to the answer.
The model decides the context exploration strategy: not a summarization or search algorithm, but the model itself expressing in code “which part of the document should I read to answer this question?”

Connection to the Ralph Loop

The Ralph Loop and RLM are different applications of the same principle:

Comparison	Ralph Loop	RLM
Iteration target	Code generation	Context exploration
Call pattern	Repeated calls from a shell loop	Model recursively calls itself
Verification condition	Test pass/fail	Answer completeness
State storage	git + filesystem	Python REPL variables
Common thread	Both call the same model repeatedly to convert test-time compute into reasoning quality

autoresearch — Autonomous Experiment Loop

Karpathy’s autoresearch

autoresearch, published by Andrej Karpathy, applies the loop paradigm to ML research. The idea is strikingly simple:

Give the agent a single train.py
The agent modifies the code freely
After a fixed 5-minute time budget, measure val_bpb (validation bits per byte)
If improved, commit; if not, reset
Run overnight — by morning, an improvement history has accumulated in git log

Write only a research direction in program.md, and the agent autonomously designs and executes specific experiments.

Released on March 7, 2026, it accumulated 21K GitHub stars and 8.6 million views, becoming a visible example of the loop paradigm. The code is compact: 630 lines, 3 files.

Real-World Results — autoresearch by the Numbers

Metric	Value
Total experiments	~700 (auto-run overnight)
Optimizations found	20
Speed improvement	11%
Code size	630 lines, 3 files

Shopify CEO Tobi Lütke applied autoresearch to the company’s ML pipeline and achieved 19% validation improvement from 37 experiments run in a single night. What would take a human researcher a week was done by an agent in 8 hours.

These results are captured in what has become known as “The Karpathy Loop”:

agent + single modifiable file + single metric + fixed time limit = automated research

The key is constraint design. One file, one metric, one time limit — the more you restrict the agent’s degrees of freedom, the higher the loop’s quality. This is exactly the same principle as Ralph loop’s “deterministic verification conditions.”

Core Design Principles

Principle	Description	Why It Matters
Fixed time budget	Each experiment capped at 5 minutes	Fair comparison. If one experiment takes 40 minutes, three others can’t run
git branch-based	Success = commit, failure = reset	Failed experiment artifacts don’t contaminate the next experiment
Single metric	Only `val_bpb` is measured	Removes ambiguity. Answers “did it improve?” with a number
program.md	Research direction text file	Same role as PROMPT.md. Humans define strategy; agents execute tactics

Distributed Research Vision: The SETI@home Pattern

Karpathy envisions the future of autoresearch as distributed research. Just as SETI@home distributed the search for extraterrestrial signals, hundreds of agents each run different experiments in parallel, and only improved results are merged to a central repository. An improvement discovered by one agent becomes the starting point for another agent’s next experiment.

Connection to the Ralph Loop

autoresearch and the Ralph Loop follow the same pattern:

Comparison	Ralph Loop	autoresearch
Target	Software code	ML training code
Verification condition	Tests pass	val_bpb improvement
State management	`git checkout .`	`git reset`
Instruction file	PROMPT.md	program.md
Essence	Deterministic verification + loop = quality	Deterministic metric + loop = performance

The only difference is the verification condition. Ralph asks “does the code pass tests?”; autoresearch asks “does val_bpb improve?” The rest of the architecture — loop, git-based state management, text-file instructions — is identical.

Integrating the Three Loops — A Common Architecture

Comparison Table

Item	Ralph Loop	RLM	autoresearch
Application domain	Software development	Long-context understanding	ML experimentation
Verification condition	Compile + test pass	Answer completeness	val_bpb value
State storage	git + filesystem	Python REPL variables	git branch
Context strategy	New context each loop (wiping)	Recursive context exploration	New context each experiment
Failure handling	`git checkout .`	Recursive splitting	`git reset`
Human role	Write PROMPT.md	Pose the question	Write program.md

The Three Common Elements

The three essential elements shared by all three loops:

The important thought — PROMPT.md, the question, program.md. Humans define the “what.”
A loop with clear verification conditions — tests, answer completeness, val_bpb. Success/failure is determined deterministically.
Sufficient token budget — the more loops run, the more test-time compute accumulates. The cost is time and tokens.

When these three elements are in place, the loop paradigm can be applied to any domain.

Industry Definition of Harness — Guides vs Sensors

In 2026, Martin Fowler / ThoughtWorks classified the harness into two components in their analysis of agentic coding:

Component	Direction	Role	Examples
Guides	Feedforward	Provide direction before the agent acts	PROMPT.md, CLAUDE.md, linter config, type definitions
Sensors	Feedback	Measure results after the agent acts	Test results, token usage, error logs, val_bpb

Re-analyzing the three loops through this framework:

Ralph Loop: Guides = PROMPT.md + AGENTS.md, Sensors = compiler + test suite
RLM: Guides = question prompt, Sensors = answer completeness judgment
autoresearch: Guides = program.md, Sensors = val_bpb metric

LangChain further distills this as Agent = Model + Harness. The model is replaceable; the harness is a collection of domain-specific design decisions.

Economics of Loops — When to Stop Iterating

Running loops costs tokens. Infinite iteration is uneconomical. In practice, you need to calculate the break-even point.

Estimated token cost per iteration (commercial coding model, one code modification):

Item	Tokens	Cost
Input (system prompt + code + error log)	~2,000	$0.03
Output (modified code + explanation)	~4,000	$0.30
Total per iteration	~6,000	~$0.33

Scenario	Loop cost	Comparison	Verdict
10 iterations to fix a bug	$3.3	Developer 30 min ($25)	Loop is 7.5× cheaper
50 iterations for refactoring	$16.5	Developer 2 hours ($100)	Loop is 6× cheaper
200 iterations, fails to converge	$66	Developer 2 hours ($100)	Still cheaper but inefficient

Connections to Other Weeks

Week 2 governance: the moment a loop leaves a side effect in the outside world is the Hard Interrupt point. When a Ralph loop creates a PR, or autoresearch saves a model checkpoint to shared storage, the CUD policy from Week 2 activates.
Week 3 MIG/MCP: MIG isolates the loop’s compute. If one student’s loop causes an OOM, it doesn’t affect other students. MCP standardizes tool access within the loop.
Week 5 Context Rot: Ralph’s context wiping (fresh context every loop) is the first strategy in the Context Rot solution. Week 5 covers this systematically.

Practical Harness Optimization — Parallel Sessions and Worktree Isolation

Parts 1–5 covered the principles of the loop. Now we turn to the practical techniques for running loops faster, safer, and at scale. The 42 tips that Claude Code creator Boris Cherny released over January–February 2026 are not individual tricks — they form a stack in which each layer presupposes the one below. Understanding this stack lets you push Ralph loop throughput from single digits to double digits.

Parallel Sessions — Developer-Level Multiplexing

Boris runs 5 Claude Code instances in his terminal and 5–10 more on claude.ai/code simultaneously. This is not a multi-agent pipeline. It is one developer supervising multiple loops at the same time — real-world parallelism.

# Set up parallel sessions with tmux
tmux new-session -s loops -d
for i in 1 2 3 4; do tmux split-window -t loops; done
tmux select-layout -t loops tiled

# Run independent tasks in each pane
# pane 0: claude "Refactor frontend components"
# pane 1: claude "Write API endpoint tests"
# pane 2: claude "Update documentation"
# pane 3: claude "Fix lint errors"
# pane 4: claude "Run performance profiling"

The key is that each instance handles an independent task. There is no inter-agent communication. Git is the sole coordination mechanism — a natural extension of the Ralph loop’s monolithic philosophy .

The `--worktree` Flag — Native Isolation

In Part 2 we saw that /loop automatically creates a worktree. The --worktree flag extends this isolation to ad-hoc sessions.

# Basic: give Claude a dedicated worktree
claude --worktree my_feature

# Also auto-create a tmux session
claude --worktree my_feature --tmux

# Real-world pattern: 10 parallel agents for a migration
for module in auth billing users payments notifications \
              search analytics admin logging config; do
  claude --worktree "migrate-${module}" --tmux \
    "Migrate sync I/O in module ${module} to async. \
     Open a PR once all tests pass."
done

Without worktree
With worktree

# Dangerous: all agents share the same working directory
claude "Modify auth module"     # ← file collision risk
claude "Modify billing module"  # ← file collision risk
# One agent's git checkout . can delete another agent's work

# Safe: each agent works in an isolated filesystem
claude --worktree auth "Modify auth module"
claude --worktree billing "Modify billing module"
# Each worktree is an independent branch → merge conflicts resolved at PR stage

/loop’s worktree is optimized for time-based iteration; --worktree is optimized for parallel task distribution. The two are complementary.

`/sandbox` — Trust-Based Isolation

The MIG we learned in Week 3 provided GPU compute isolation — one student’s OOM cannot propagate to another. /sandbox isolates BashTool’s file and network access, providing a different dimension of protection.

Dimension	MIG (Week 3)	`/sandbox` (Week 4)
What is isolated	GPU compute	Filesystem + network
Purpose	Resource protection — block one student’s OOM	Trust — accept agent edits faster
Mechanism	Hardware partitioning	BashTool file/network restriction
Effect	”This process cannot touch my GPU"	"If this agent makes a mistake, the blast radius is clear”

Boris’s core insight: “When you trust the containment, you accept edits faster. That speeds up the whole loop.” Trusting the isolation reduces human review time per loop cycle, raising the total throughput of the loop.

Stack Philosophy — Layers Build on Each Other

What makes Boris’s 42 tips pedagogically valuable is that they form a stack, not a menu. Each layer presupposes the one beneath it:

BORIS STACK — Harness Optimization Layers

Layer 1: Plan ModePlan before executing

↓

Layer 2: CLAUDE.mdProject rules are injected every session

↓

Layer 3: Worktree IsolationEach agent works in an independent filesystem

↓

Layer 4: Parallel Sessions + /sandboxRun multiple isolated agents simultaneously

↓

Layer 5: /loop + /batchTime-based autonomous iteration + large-scale batch processing

Tracing the order in reverse explains why it matters:

To migrate 10 modules simultaneously with /batch → each needs worktree isolation
For agents in a worktree to behave correctly → CLAUDE.md must inject project rules
For CLAUDE.md to be effective → the agent must form a plan with plan mode before executing

Using an upper layer without the lower one will fail. Running parallel sessions without worktrees causes file collisions; using worktrees without CLAUDE.md means agents are ignorant of project conventions.

Agentmaxxing — Multi-Tool Parallel Execution

Starting in early 2026, an extreme parallelization strategy called “Agentmaxxing” emerged: deploying multiple AI coding tools simultaneously in a single repo.

# Terminal 1: Claude Code (architecture design)
claude --worktree arch "Refactor module structure"

# Terminal 2: Codex (test writing)
codex --approval-mode full-auto "Add missing test cases"

# Terminal 3: Gemini CLI (documentation)
gemini "Auto-generate API docs from code"

Cursor 2.0 productized this pattern with Background Agents (running in isolated VMs) and Mission Control (a parallel agent dashboard). Codex CLI recorded 1.6 million weekly active users as of March 2026, establishing itself as the reference implementation for open-source harnesses.

Real-World Validation — Building a Product Without Writing a Line of Code

We have covered the theory and techniques. The question now: can loops alone build production software? In March 2026, OpenAI’s case of building an internal product using only the Codex agent provides the answer.

The Codex “Zero Manual Code” Case

The OpenAI engineering team used Codex to build an internal product:

1M+ LOC generated automatically
0 lines written manually
Humans performed only requirements definition, architecture review, and PR approval

The five patterns derived from this process are a field manual for harness engineering:

Pattern	Principle	Ralph Loop Perspective
Repo as System of Record	Code = single source of truth. All decisions reflected in code, not verbal agreements or wikis	PROMPT.md + AGENTS.md serve this role
Application Legibility	Write code that agents can read. Clear variable names, types, and comments are prerequisites for correct agent modification	Prerequisite for backpressure — linters and type checkers need readable code to function
Layered Domain Architecture	Clearly separate domain layers. Changes to one layer don’t propagate to others	Prerequisite for parallel worktrees — modules must be separated for parallel modification
Minimal Merge Gates	Minimize merge gates. Auto-merge on test pass	The key to loop speed — approval wait time determines loop efficiency
Entropy Management	Actively manage codebase disorder (entropy). Prevent technical debt accumulation via loops	`/simplify` pattern — a separate agent periodically cleans up code quality

In-Class Discussion Questions

Ralph’s “fresh context every loop” and RLM’s “recursive calls” solve the context window problem in opposite directions. Can these two be combined? What form would it take?
Why is autoresearch’s 5-minute fixed budget important? What problems arise if you run it “until improvement” with no budget?
What is the difference between recording learnings in AGENTS.md in the Ralph loop and fine-tuning the model? What are the advantages and disadvantages of each?
If you ran 100 loops during your 8 hours of sleep, what task would you apply it to? What would you set as the verification condition?
What are the limits of the loop paradigm? What problems can’t be solved by iteration?
In Boris’s stack, what problems arise if you use only worktrees without CLAUDE.md? Conversely, what if you use only CLAUDE.md without worktrees? Explain the inter-layer dependency with a concrete scenario.
Do you agree with the claim that /sandbox’s purpose is trust, not security? Discuss specifically what behavioral changes “isolation increases trust” actually implies.
A model scoring 80.9% on SWE-bench Verified drops to 45.9% on Pro. Can the 35 pp gap be closed by the model alone, without a harness? If doubling model size doesn’t close the gap, what does that tell us?
If Codex auto-generated 1M lines, what is the programmer’s role? What does “Application Legibility” — one of the five patterns — suggest: is the ability to write code that agents can read becoming the programmer’s new core competency?

Practicum

Implementing the Ralph Loop with Cumulative Learning

Set up the project structure

mkdir ralph-project && cd ralph-project
git init
touch PROMPT.md AGENTS.md
mkdir tests

Write PROMPT.md

# Current Task
Implement the following items in order:
- [ ] Implement add(a, b) function in calculator.py
- [ ] Implement subtract(a, b) function in calculator.py
- [ ] Implement divide(a, b) function in calculator.py (handle ZeroDivisionError)

# Constraints
- Implement only one function at a time
- Include type hints in all functions
- Do not write code without tests
- Write tests in tests/test_calculator.py before implementing

# State Tracking
- Always read AGENTS.md before starting
- On failure: record the cause in AGENTS.md, then exit
- On success: record the success pattern in AGENTS.md, then proceed to the next task

Write harness.sh — backpressure + garbage collection

#!/bin/bash
# harness.sh — Ralph loop harness

set -e
MAX_RETRIES=10
RETRY_COUNT=0

while true; do
  echo "=== Ralph Loop #$((RETRY_COUNT + 1)) ==="

  # Run the agent
  cat PROMPT.md | claude-code

  # Backpressure: type check + test run
  if python -m py_compile calculator.py 2>/dev/null && \
     python -m pytest tests/ -q 2>/dev/null; then
    echo "Tests passed — committing and moving on"
    git add -A && git commit -m "loop $((RETRY_COUNT + 1)): task completed"
    RETRY_COUNT=0
  else
    echo "Tests failed — garbage collection + retry"
    RETRY_COUNT=$((RETRY_COUNT + 1))

    # Stuck detection
    if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
      echo "STUCK: $MAX_RETRIES consecutive failures. Splitting task."
      echo "- STUCK at loop $RETRY_COUNT: $(date)" >> AGENTS.md
      git add AGENTS.md && git commit -m "stuck: recorded failure pattern"
      RETRY_COUNT=0
    fi

    # Garbage Collection — remove failed code
    git checkout -- calculator.py tests/ 2>/dev/null || true
    sleep 2
  fi
done

Initialize the AGENTS.md cumulative learning structure

# AGENTS.md — Cumulative Learning Record

## Learned Patterns
(filled automatically after loop runs)

## Forbidden Patterns
(anti-patterns learned from repeated failures)

## Progress Status
- [ ] add() implementation
- [ ] subtract() implementation
- [ ] divide() implementation

Run and observe — minimum 3 loops
Terminal window
```
chmod +x harness.sh
./harness.sh
```
Observation points:
- Does the agent read AGENTS.md before starting?
- Is learning recorded in AGENTS.md after a failure?
- Does the next loop avoid repeating the same mistake?
- Does git log --oneline show only successful commits (no failed code committed)?

Mini autoresearch pattern experiment (optional)

An autoresearch pattern to optimize the execution time of a simple Python function:

#!/bin/bash
BEST_TIME=999

while true; do
  # Agent modifies optimize.py
  cat program.md | claude-code

  # Run benchmark with 5-minute time limit
  CURRENT_TIME=$(timeout 300 python benchmark.py 2>/dev/null || echo 999)

  if (( $(echo "$CURRENT_TIME < $BEST_TIME" | bc -l) )); then
    echo "Improved: $BEST_TIME -> $CURRENT_TIME"
    BEST_TIME=$CURRENT_TIME
    git add -A && git commit -m "improved: time=$CURRENT_TIME"
  else
    echo "No improvement ($CURRENT_TIME >= $BEST_TIME) — reset"
    git checkout .
  fi
done

Practicum Checklist

Does PROMPT.md specify 3 or more tasks?
Does AGENTS.md contain at least 2 learning records?
Does harness.sh implement both backpressure (test runs) and garbage collection (git checkout .)?
Does git log show only successful commits? (No failed code committed?)
Is the failure → learning → retry → success process visible in the terminal log?
Does stuck detection logic exist? (optional)

Assignment

Lab 04: Ralph Loop Implementation

Submission deadline: 2026-03-31 23:59

Submission path: assignments/week-04/[student ID]/

Required deliverables (5 items):

harness.sh — Ralph harness with backpressure + garbage collection
PROMPT.md — specification of at least 3 tasks
AGENTS.md — cumulative learning record after loop runs (must show failure-after-failure-then-success across at least 2 failures)
Loop execution log — terminal log showing the failure → learning → retry process (loop_log.txt)
README.md — harness design decisions + explanation of the connection to test-time compute scaling

Bonus items (5 items):

Mini autoresearch pattern implementation — Python function optimization with fixed time budget
Stuck detection logic — on N consecutive failures, split the task into smaller units and retry
Loop metric collection — record per-iteration token usage, success rate, and elapsed time as CSV/JSON
Analysis report on worktree isolation behavior after running Claude Code /loop
Long-document analysis experiment using RLM principles — include results and reflection in README

Evaluation criteria:

Garbage collection mechanism operates correctly
Evidence that AGENTS.md cumulative learning is actually reflected in subsequent loops
Error log packaging and state tracking file utilization

Key Takeaways

Test-time compute scaling: performance improves by investing more compute at inference time without growing the model. All three loops are implementations of this principle.
Ralph Loop = harness + loop: the harness (backpressure + garbage collection) controls non-deterministic agents deterministically. AGENTS.md enables cumulative learning.
RLM = recursive context exploration: the model recursively calls itself to process long documents. It explores context without information loss, and the recursion trajectory is preserved as code.
autoresearch = autonomous experiment loop: fixed time budget + single metric + git-based state management. Same pattern as Ralph, differing only in the verification condition.
Three common elements: the important thought (instruction file) + clear verification conditions (loop) + sufficient token budget. Once these are in place, the loop paradigm can be applied to any domain.
Infrastructure connections: Week 2 governance sets safety boundaries for loops, Week 3 MIG provides compute isolation and MCP standardizes tools. Ralph’s context wiping is the starting point for the Context Rot solution in Week 5.

Week 4: Loop Paradigm — Iteration Beats Complexity

Theory

Today’s Learning Objectives

Why Loops — The Decisive Edge in 2026 AI

Test-Time Compute Scaling — Theoretical Foundation

Boosting Performance Without Growing the Model

How the Three Loops Leverage Test-Time Compute

The “22-Point Harness Swing” — The Harness Matters More Than the Model

T2 Scaling: Inference Cost Changes Pretraining Strategy

The Evolution of the effort Parameter

Ralph Loop Deep Dive

The Archetype: One Line of Bash

The Advantage of Monolithic Architecture

Harness Architecture

Failure Accumulation vs State Reset

Cumulative Learning Structure — AGENTS.md

Fresh Context vs Continuous Session — Huntley’s Choice

Claude Code /loop — Official Automation Loop

RLM — Recursive LM Calls

The Problem: LLMs Miss the Middle of Long Documents

RLM’s Solution: The Model Recursively Calls Itself

RLM Results and Significance

Connection to the Ralph Loop

autoresearch — Autonomous Experiment Loop

Karpathy’s autoresearch

Real-World Results — autoresearch by the Numbers

Core Design Principles

Distributed Research Vision: The SETI@home Pattern

Connection to the Ralph Loop

Integrating the Three Loops — A Common Architecture

Comparison Table

The Three Common Elements

Industry Definition of Harness — Guides vs Sensors

Economics of Loops — When to Stop Iterating

Connections to Other Weeks

Practical Harness Optimization — Parallel Sessions and Worktree Isolation

Parallel Sessions — Developer-Level Multiplexing

The --worktree Flag — Native Isolation

/sandbox — Trust-Based Isolation

Stack Philosophy — Layers Build on Each Other

Agentmaxxing — Multi-Tool Parallel Execution

Real-World Validation — Building a Product Without Writing a Line of Code

The Codex “Zero Manual Code” Case

In-Class Discussion Questions

Practicum

Implementing the Ralph Loop with Cumulative Learning

Practicum Checklist

Assignment

Lab 04: Ralph Loop Implementation

Key Takeaways

Further Reading

The Evolution of the `effort` Parameter

Claude Code `/loop` — Official Automation Loop

The `--worktree` Flag — Native Isolation

`/sandbox` — Trust-Based Isolation