Theory Perspective
Explain why test-time compute scaling is the theoretical foundation of the loop paradigm.
Theory Perspective
Explain why test-time compute scaling is the theoretical foundation of the loop paradigm.
Ralph Perspective
Understand and implement the working principles of harness, backpressure, garbage collection, and cumulative learning.
Extension Perspective
Understand that RLM (recursive reasoning) and autoresearch (autonomous experimentation) are different applications of the same loop principle.
Implementation Perspective
Implement a Ralph loop yourself and experience the failure → learning → success cycle firsthand.
A single loop running overnight delivers better results than a complex pipeline.
The decisive edge in 2026 AI engineering is neither a bigger model nor more sophisticated prompts. It is loop iteration. Run the same model hundreds of times, verify the result each time, discard failures, and accumulate successes. Simple — yet this pattern consistently beats complex multi-agent architectures.
The three loops covered this week are proof:
git checkout ., and retry. The code-quality loop.train.py to an agent, measure results every 5 minutes, commit if improved, reset if not. The research loop.All three loops rest on the same principle. The governance learned in Week 2 sets the safety boundaries for loops; the MIG isolation and MCP tool standardization learned in Week 3 provide the infrastructure. Today we understand and implement the loops that actually run on top of that infrastructure.
There is a key finding validated by OpenAI o1 in 2024: even without increasing model size, performance improves when more compute is spent at inference time. This is test-time compute scaling.
The traditional AI performance strategy was train-time scaling — larger models, more data, longer training runs. The jump from GPT-3 to GPT-4 is the canonical example. But this approach incurs exponentially increasing costs.
Test-time compute scaling takes a different path. It makes an already-trained model think longer and more during inference. Specifically:
Ralph, RLM, and autoresearch are all concrete realizations of test-time compute scaling:
| Loop | Application Domain | How Test-Time Compute Is Used |
|---|---|---|
| Ralph Loop | Code generation | Generate code → run tests → retry on failure. 10 attempts = 10× the inference tokens |
| RLM | Long-context understanding | Model recursively calls itself. Inference compute grows with recursion depth |
| autoresearch | ML experimentation | 5-minute budget × N iterations. Compute invested scales with iteration count |
See the pattern? All three call the same model repeatedly while filtering results through deterministic verification conditions. Rather than changing the model itself, they increase the time and number of times the model thinks to ensure quality.
The 2026 SWE-bench benchmark provides empirical proof of this lecture’s central claim:
| Benchmark | Claude Opus 4.5 Score | Description |
|---|---|---|
| SWE-bench Verified | 80.9% | Validated issue set, standard scaffold |
| SWE-bench Pro | 45.9% | More realistic issues, minimal scaffold |
The same model yields a 35 pp gap purely from the difference in harness/scaffold. GPT-5.3-Codex leads on Pro at 56.8%, but that too is thanks to OpenAI’s own agent harness.
The T2 scaling laws published in April 2026 (arxiv 2604.01411) dig one level deeper into the economics of test-time compute. The original Chinchilla work (2022) optimized only training compute costs; T2 recalculates the optimal point including inference costs.
Key findings:
effort ParameterIn Claude 4.x, the old budget_tokens (where the developer specifies the token count directly) was deprecated and replaced by an adaptive effort parameter:
| effort level | Behavior | Use case |
|---|---|---|
low | Short reasoning, fast response | Simple lookups, type checks |
medium | Standard reasoning | General coding |
high | Deep reasoning, more tokens | Complex refactoring |
max | Maximum reasoning depth | Architecture design, debugging |
This reflects the trend of test-time compute allocation moving from developer control to model-autonomous control. Loop paradigm implication: a strategy of dynamically adjusting effort each iteration is possible — explore quickly with low early on, then drill deep with high once a promising direction emerges.
The Ralph Loop (Ralph Wiggum Loop), popularized by developer Geoffrey Huntley in late 2025, is the core paradigm of agentic software development.
# The essence of the Ralph loop — a single linewhile :; do cat PROMPT.md | claude-code; done# The same pattern works with other AI coding CLI tools:# while :; do cat PROMPT.md | gemini; done# while :; do codex --approval-mode full-auto "$(cat PROMPT.md)"; doneTwo design decisions run through this infinite loop:
This simplicity is the point. No inter-agent communication, no state DB, no orchestrator. The intelligence of the loop lies not in the loop itself but in the environmental constraints — this is the harness.
The Ralph loop is intentionally monolithic:
Microservices (Complex)
Ralph Loop (Simple)
The three pillars of the harness:
pytest fails, that code is as good as nonexistent.git checkout . completely removes failed code. The repository always returns to the last successful state. Not like a Jenga tower where a wrong block topples everything, but like a potter’s wheel where you rework the clay if the shape isn’t right.The evolved form of the Ralph loop goes beyond simple repetition to a loop that learns. The key is the AGENTS.md file.
# AGENTS.md — Cumulative Learning Record
## Learned Patterns- The division function in calculator.py requires ZeroDivisionError handling (failed in loop 3)- pytest fixtures must be placed in conftest.py to prevent import errors (failed in loop 5)
## Forbidden Patterns- Do not use `eval()` — security risk + bypasses type checker- Do not share state via global variables — breaks test isolation
## Current Status- [x] add() implementation complete- [x] subtract() implementation complete- [ ] multiply() in progressAt the end of each loop the agent records what it learned in this loop in AGENTS.md. The next loop’s agent reads this file before starting, so it does not repeat the same mistakes.
This is fundamentally different from fine-tuning:
“Failure itself is information” — an agent’s failed attempt is “deterministically bad,” and this information becomes the input for the next loop. If the same task fails 10 or more times in a row, it is judged stuck, and the task is split into smaller units before retrying.
In January 2026, an interesting “showdown” took place between Huntley and Anthropic. Anthropic proposed a stop-hook plugin for the Ralph loop that used a continuous session approach (repeating while maintaining the same context).
| Approach | Method | Advantages | Disadvantages |
|---|---|---|---|
| Original Ralph | New session every loop | Context Rot completely prevented, deterministic | Loses context from previous attempts |
| Anthropic plugin | Iterate within continuous session | Remembers previous attempts, faster convergence | Context Rot risk, non-deterministic |
Huntley chose fresh context. His reasoning: “If you fail 30 times in a continuous session, the context fills up with failure history and useful reasoning becomes impossible. Better to record just the essentials in AGENTS.md and start clean every time.”
This trade-off connects directly to Week 5’s Context Rot — Ralph’s fresh context strategy is the first solution to Context Rot.
/loop — Official Automation LoopThe /loop command officially shipped in Claude Code by Anthropic in 2026 implements the Ralph loop philosophy at the product level. Instead of a manual while shell script, a single line spins up a schedule-based autonomous agent.
# Basic syntaxclaude /loop "<instruction>" --every <interval> --for <duration>
# Example: find and fix failing tests every 2 hours, for up to 3 daysclaude /loop "check for failing tests and fix them" --every 2h --for 3dCore design principles:
| Element | Description |
|---|---|
| Worktree isolation | Every iteration runs via git worktree without affecting the main branch |
| CLAUDE.md = control plane | CLAUDE.md is read every cycle, so modifying the instruction file changes the behavior of a running loop |
| 3-day expiry | Maximum --for 3d. Intentional design to prevent context drift in forgotten autonomous agents |
| 3 flags | "instruction", --every, --for — this is the entire API surface |
Why the 3-day expiry is a feature: if a loop set on Tuesday runs through Friday, by then 15 PRs merged by the team will conflict with it. An agent confidently patching with stale context creates problems harder to debug than the original bug. Re-evaluating and restarting every 72 hours is the safe approach.
Three validated workflows:
claude /loop "check open PRs on the current branch. If CI is failing,read the error logs, fix the issue, and push. If CI passes and the PRhas no requested changes, post a comment saying 'Ready for human review.'Summarize what you did in the PR description." --every 30m --for 2dCI pipeline monitoring, automatic lint/type error fixes, notification when ready. Cannot handle failures requiring business logic judgment.
claude /loop "run pnpm audit. If any high or critical vulnerabilities exist,create a branch, update the affected packages, run the test suite, andopen a PR if tests pass. Include the vulnerability details in the PR body."--every 4h --for 3dClear success criteria (tests pass) and minimal business context — an ideal task for /loop.
claude /loop "summarize all commits merged to main in the last 24 hours.Include: PR titles, authors, files changed count, and any test coveragechanges. Write the summary as a Markdown standup update and save it to./reports/standup-$(date +%Y-%m-%d).md" --every 24h --for 3dAutomates a manual process. File output provides basic functionality even without Slack MCP integration.
When /loop fails:
Even with a 200K-token context window, when a long document is passed to an LLM, understanding of the middle sections degrades compared to the beginning and end (the “Lost in the Middle” phenomenon). The existing solutions were two:
RLM (Recursive Language Model) takes a fundamentally different approach. It loads the long prompt into Python REPL variables, then has the model write code to extract only the needed portions and recursively call itself.
The core idea: instead of enlarging the context window, let the model decide for itself how to read the context.
# RLM pseudocode — recursive call patterndef rlm_solve(question: str, documents: list[str]) -> str: """The model recursively calls itself to process long documents."""
# Step 1: load all documents into Python variables context_vars = {f"doc_{i}": doc for i, doc in enumerate(documents)}
# Step 2: ask the model to write code that decides "which parts to read" planning_code = llm_call( f"Write Python code to determine which parts of {len(documents)} documents " f"need to be read to answer the following question.\n" f"Question: {question}" )
# Step 3: execute the code → extract only relevant parts relevant_parts = execute(planning_code, context_vars)
# Step 4: recursively call itself with the extracted parts if fits_in_context(relevant_parts): return llm_call(f"Question: {question}\nContext: {relevant_parts}") else: # Still too large — recurse again return rlm_solve(question, split(relevant_parts))The 2025 results from Zhang et al. are striking: GPT-5-mini + RLM achieved more than 2× the performance of GPT-5 alone on the OOLONG benchmark. A smaller model outperformed a larger model purely through recursive calls.
Why this is possible:
The Ralph Loop and RLM are different applications of the same principle:
| Comparison | Ralph Loop | RLM |
|---|---|---|
| Iteration target | Code generation | Context exploration |
| Call pattern | Repeated calls from a shell loop | Model recursively calls itself |
| Verification condition | Test pass/fail | Answer completeness |
| State storage | git + filesystem | Python REPL variables |
| Common thread | Both call the same model repeatedly to convert test-time compute into reasoning quality |
autoresearch, published by Andrej Karpathy, applies the loop paradigm to ML research. The idea is strikingly simple:
train.pyval_bpb (validation bits per byte)Write only a research direction in program.md, and the agent autonomously designs and executes specific experiments.
Released on March 7, 2026, it accumulated 21K GitHub stars and 8.6 million views, becoming the symbol of the loop paradigm. The code is characterized by extreme simplicity: 630 lines, 3 files.
| Metric | Value |
|---|---|
| Total experiments | ~700 (auto-run overnight) |
| Optimizations found | 20 |
| Speed improvement | 11% |
| Code size | 630 lines, 3 files |
Shopify CEO Tobi Lütke applied autoresearch to the company’s ML pipeline and achieved 19% validation improvement from 37 experiments run in a single night. What would take a human researcher a week was done by an agent in 8 hours.
These results are captured in what has become known as “The Karpathy Loop”:
agent + single modifiable file + single metric + fixed time limit = automated research
The key is constraint design. One file, one metric, one time limit — the more you restrict the agent’s degrees of freedom, the higher the loop’s quality. This is exactly the same principle as Ralph loop’s “deterministic verification conditions.”
| Principle | Description | Why It Matters |
|---|---|---|
| Fixed time budget | Each experiment capped at 5 minutes | Fair comparison. If one experiment takes 40 minutes, three others can’t run |
| git branch-based | Success = commit, failure = reset | Failed experiment artifacts don’t contaminate the next experiment |
| Single metric | Only val_bpb is measured | Removes ambiguity. Answers “did it improve?” with a number |
| program.md | Research direction text file | Same role as PROMPT.md. Humans define strategy; agents execute tactics |
Karpathy envisions the future of autoresearch as distributed research. Just as SETI@home distributed the search for extraterrestrial signals, hundreds of agents each run different experiments in parallel, and only improved results are merged to a central repository. An improvement discovered by one agent becomes the starting point for another agent’s next experiment.
autoresearch and the Ralph Loop follow the same pattern:
| Comparison | Ralph Loop | autoresearch |
|---|---|---|
| Target | Software code | ML training code |
| Verification condition | Tests pass | val_bpb improvement |
| State management | git checkout . | git reset |
| Instruction file | PROMPT.md | program.md |
| Essence | Deterministic verification + loop = quality | Deterministic metric + loop = performance |
The only difference is the verification condition. Ralph asks “does the code pass tests?”; autoresearch asks “does val_bpb improve?” The rest of the architecture — loop, git-based state management, text-file instructions — is identical.
| Item | Ralph Loop | RLM | autoresearch |
|---|---|---|---|
| Application domain | Software development | Long-context understanding | ML experimentation |
| Verification condition | Compile + test pass | Answer completeness | val_bpb value |
| State storage | git + filesystem | Python REPL variables | git branch |
| Context strategy | New context each loop (wiping) | Recursive context exploration | New context each experiment |
| Failure handling | git checkout . | Recursive splitting | git reset |
| Human role | Write PROMPT.md | Pose the question | Write program.md |
The three essential elements shared by all three loops:
When these three elements are in place, the loop paradigm can be applied to any domain.
In 2026, Martin Fowler / ThoughtWorks classified the harness into two components in their analysis of agentic coding:
| Component | Direction | Role | Examples |
|---|---|---|---|
| Guides | Feedforward | Provide direction before the agent acts | PROMPT.md, CLAUDE.md, linter config, type definitions |
| Sensors | Feedback | Measure results after the agent acts | Test results, token usage, error logs, val_bpb |
Re-analyzing the three loops through this framework:
LangChain further distills this as Agent = Model + Harness. The model is replaceable; the harness is a collection of domain-specific design decisions.
Running loops costs tokens. Infinite iteration is uneconomical. In practice, you need to calculate the break-even point.
Estimated token cost per iteration (Sonnet 4.6, one code modification):
| Item | Tokens | Cost |
|---|---|---|
| Input (system prompt + code + error log) | ~2,000 | $0.03 |
| Output (modified code + explanation) | ~4,000 | $0.30 |
| Total per iteration | ~6,000 | ~$0.33 |
| Scenario | Loop cost | Comparison | Verdict |
|---|---|---|---|
| 10 iterations to fix a bug | $3.3 | Developer 30 min ($25) | Loop is 7.5× cheaper |
| 50 iterations for refactoring | $16.5 | Developer 2 hours ($100) | Loop is 6× cheaper |
| 200 iterations, fails to converge | $66 | Developer 2 hours ($100) | Still cheaper but inefficient |
Parts 1–5 covered the principles of the loop. Now we turn to the practical techniques for running loops faster, safer, and at scale. The 42 tips that Claude Code creator Boris Cherny released over January–February 2026 are not individual tricks — they form a stack in which each layer presupposes the one below. Understanding this stack lets you push Ralph loop throughput from single digits to double digits.
Boris runs 5 Claude Code instances in his terminal and 5–10 more on claude.ai/code simultaneously. This is not a multi-agent pipeline. It is one developer supervising multiple loops at the same time — real-world parallelism.
# Set up parallel sessions with tmuxtmux new-session -s loops -dfor i in 1 2 3 4; do tmux split-window -t loops; donetmux select-layout -t loops tiled
# Run independent tasks in each pane# pane 0: claude "Refactor frontend components"# pane 1: claude "Write API endpoint tests"# pane 2: claude "Update documentation"# pane 3: claude "Fix lint errors"# pane 4: claude "Run performance profiling"The key is that each instance handles an independent task. There is no inter-agent communication. Git is the sole coordination mechanism — a natural extension of the Ralph loop’s monolithic philosophy .
--worktree Flag — Native IsolationIn Part 2 we saw that /loop automatically creates a worktree. The --worktree flag extends this isolation to ad-hoc sessions.
# Basic: give Claude a dedicated worktreeclaude --worktree my_feature
# Also auto-create a tmux sessionclaude --worktree my_feature --tmux
# Real-world pattern: 10 parallel agents for a migrationfor module in auth billing users payments notifications \ search analytics admin logging config; do claude --worktree "migrate-${module}" --tmux \ "Migrate sync I/O in module ${module} to async. \ Open a PR once all tests pass."done# Dangerous: all agents share the same working directoryclaude "Modify auth module" # ← file collision riskclaude "Modify billing module" # ← file collision risk# One agent's git checkout . can delete another agent's work# Safe: each agent works in an isolated filesystemclaude --worktree auth "Modify auth module"claude --worktree billing "Modify billing module"# Each worktree is an independent branch → merge conflicts resolved at PR stage/loop’s worktree is optimized for time-based iteration; --worktree is optimized for parallel task distribution. The two are complementary.
/sandbox — Trust-Based IsolationThe MIG we learned in Week 3 provided GPU compute isolation — one student’s OOM cannot propagate to another. /sandbox isolates BashTool’s file and network access, providing a different dimension of protection.
| Dimension | MIG (Week 3) | /sandbox (Week 4) |
|---|---|---|
| What is isolated | GPU compute | Filesystem + network |
| Purpose | Resource protection — block one student’s OOM | Trust — accept agent edits faster |
| Mechanism | Hardware partitioning | BashTool file/network restriction |
| Effect | ”This process cannot touch my GPU" | "If this agent makes a mistake, the blast radius is clear” |
Boris’s core insight: “When you trust the containment, you accept edits faster. That speeds up the whole loop.” Trusting the isolation reduces human review time per loop cycle, raising the total throughput of the loop.
What makes Boris’s 42 tips pedagogically valuable is that they form a stack, not a menu. Each layer presupposes the one beneath it:
Tracing the order in reverse explains why it matters:
/batch → each needs worktree isolationUsing an upper layer without the lower one will fail. Running parallel sessions without worktrees causes file collisions; using worktrees without CLAUDE.md means agents are ignorant of project conventions.
Starting in early 2026, an extreme parallelization strategy called “Agentmaxxing” emerged: deploying multiple AI coding tools simultaneously in a single repo.
# Terminal 1: Claude Code (architecture design)claude --worktree arch "Refactor module structure"
# Terminal 2: Codex (test writing)codex --approval-mode full-auto "Add missing test cases"
# Terminal 3: Gemini CLI (documentation)gemini "Auto-generate API docs from code"Cursor 2.0 productized this pattern with Background Agents (running in isolated VMs) and Mission Control (a parallel agent dashboard). Codex CLI recorded 1.6 million weekly active users as of March 2026, establishing itself as the reference implementation for open-source harnesses.
We have covered the theory and techniques. The question now: can loops alone build production software? In March 2026, OpenAI’s case of building an internal product using only the Codex agent provides the answer.
The OpenAI engineering team used Codex to build an internal product:
The five patterns derived from this process are a field manual for harness engineering:
| Pattern | Principle | Ralph Loop Perspective |
|---|---|---|
| Repo as System of Record | Code = single source of truth. All decisions reflected in code, not verbal agreements or wikis | PROMPT.md + AGENTS.md serve this role |
| Application Legibility | Write code that agents can read. Clear variable names, types, and comments are prerequisites for correct agent modification | Prerequisite for backpressure — linters and type checkers need readable code to function |
| Layered Domain Architecture | Clearly separate domain layers. Changes to one layer don’t propagate to others | Prerequisite for parallel worktrees — modules must be separated for parallel modification |
| Minimal Merge Gates | Minimize merge gates. Auto-merge on test pass | The key to loop speed — approval wait time determines loop efficiency |
| Entropy Management | Actively manage codebase disorder (entropy). Prevent technical debt accumulation via loops | /simplify pattern — a separate agent periodically cleans up code quality |
/sandbox’s purpose is trust, not security? Discuss specifically what behavioral changes “isolation increases trust” actually implies.Set up the project structure
mkdir ralph-project && cd ralph-projectgit inittouch PROMPT.md AGENTS.mdmkdir testsWrite PROMPT.md
# Current TaskImplement the following items in order:- [ ] Implement add(a, b) function in calculator.py- [ ] Implement subtract(a, b) function in calculator.py- [ ] Implement divide(a, b) function in calculator.py (handle ZeroDivisionError)
# Constraints- Implement only one function at a time- Include type hints in all functions- Do not write code without tests- Write tests in tests/test_calculator.py before implementing
# State Tracking- Always read AGENTS.md before starting- On failure: record the cause in AGENTS.md, then exit- On success: record the success pattern in AGENTS.md, then proceed to the next taskWrite harness.sh — backpressure + garbage collection
#!/bin/bash# harness.sh — Ralph loop harness
set -eMAX_RETRIES=10RETRY_COUNT=0
while true; do echo "=== Ralph Loop #$((RETRY_COUNT + 1)) ==="
# Run the agent cat PROMPT.md | claude-code
# Backpressure: type check + test run if python -m py_compile calculator.py 2>/dev/null && \ python -m pytest tests/ -q 2>/dev/null; then echo "Tests passed — committing and moving on" git add -A && git commit -m "loop $((RETRY_COUNT + 1)): task completed" RETRY_COUNT=0 else echo "Tests failed — garbage collection + retry" RETRY_COUNT=$((RETRY_COUNT + 1))
# Stuck detection if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then echo "STUCK: $MAX_RETRIES consecutive failures. Splitting task." echo "- STUCK at loop $RETRY_COUNT: $(date)" >> AGENTS.md git add AGENTS.md && git commit -m "stuck: recorded failure pattern" RETRY_COUNT=0 fi
# Garbage Collection — remove failed code git checkout -- calculator.py tests/ 2>/dev/null || true sleep 2 fidoneInitialize the AGENTS.md cumulative learning structure
# AGENTS.md — Cumulative Learning Record
## Learned Patterns(filled automatically after loop runs)
## Forbidden Patterns(anti-patterns learned from repeated failures)
## Progress Status- [ ] add() implementation- [ ] subtract() implementation- [ ] divide() implementationRun and observe — minimum 3 loops
chmod +x harness.sh./harness.shObservation points:
git log --oneline show only successful commits (no failed code committed)?Mini autoresearch pattern experiment (optional)
An autoresearch pattern to optimize the execution time of a simple Python function:
#!/bin/bashBEST_TIME=999
while true; do # Agent modifies optimize.py cat program.md | claude-code
# Run benchmark with 5-minute time limit CURRENT_TIME=$(timeout 300 python benchmark.py 2>/dev/null || echo 999)
if (( $(echo "$CURRENT_TIME < $BEST_TIME" | bc -l) )); then echo "Improved: $BEST_TIME -> $CURRENT_TIME" BEST_TIME=$CURRENT_TIME git add -A && git commit -m "improved: time=$CURRENT_TIME" else echo "No improvement ($CURRENT_TIME >= $BEST_TIME) — reset" git checkout . fidonegit checkout .)?git log show only successful commits? (No failed code committed?)Submission deadline: 2026-03-31 23:59
Submission path: assignments/week-04/[student ID]/
Required deliverables (5 items):
harness.sh — Ralph harness with backpressure + garbage collectionPROMPT.md — specification of at least 3 tasksAGENTS.md — cumulative learning record after loop runs (must show failure-after-failure-then-success across at least 2 failures)loop_log.txt)README.md — harness design decisions + explanation of the connection to test-time compute scalingBonus items (5 items):
/loopEvaluation criteria: