Skip to content

Week 4: Loop Paradigm — Iteration Beats Complexity

Phase 2Week 4IntermediateLecture: 2026-03-24

Theory Perspective

Explain why test-time compute scaling is the theoretical foundation of the loop paradigm.

Ralph Perspective

Understand and implement the working principles of harness, backpressure, garbage collection, and cumulative learning.

Extension Perspective

Understand that RLM (recursive reasoning) and autoresearch (autonomous experimentation) are different applications of the same loop principle.

Implementation Perspective

Implement a Ralph loop yourself and experience the failure → learning → success cycle firsthand.


Why Loops — The Decisive Edge in 2026 AI

Section titled “Why Loops — The Decisive Edge in 2026 AI”

A single loop running overnight delivers better results than a complex pipeline.

The decisive edge in 2026 AI engineering is neither a bigger model nor more sophisticated prompts. It is loop iteration. Run the same model hundreds of times, verify the result each time, discard failures, and accumulate successes. Simple — yet this pattern consistently beats complex multi-agent architectures.

The three loops covered this week are proof:

  • Ralph Loop — generate code, verify with tests, discard failures with git checkout ., and retry. The code-quality loop.
  • RLM — the model recursively calls itself to extract only the relevant information from long documents. The context-comprehension loop.
  • autoresearch — hand train.py to an agent, measure results every 5 minutes, commit if improved, reset if not. The research loop.

All three loops rest on the same principle. The governance learned in Week 2 sets the safety boundaries for loops; the MIG isolation and MCP tool standardization learned in Week 3 provide the infrastructure. Today we understand and implement the loops that actually run on top of that infrastructure.


Test-Time Compute Scaling — Theoretical Foundation

Section titled “Test-Time Compute Scaling — Theoretical Foundation”

Boosting Performance Without Growing the Model

Section titled “Boosting Performance Without Growing the Model”

There is a key finding validated by OpenAI o1 in 2024: even without increasing model size, performance improves when more compute is spent at inference time. This is test-time compute scaling.

The traditional AI performance strategy was train-time scaling — larger models, more data, longer training runs. The jump from GPT-3 to GPT-4 is the canonical example. But this approach incurs exponentially increasing costs.

Test-time compute scaling takes a different path. It makes an already-trained model think longer and more during inference. Specifically:

  • o1 explores multiple reasoning paths for the same problem and selects the best answer
  • The more tokens invested in reasoning (= the longer it thinks), the higher the accuracy
  • Up to a point, increasing inference compute is more cost-effective than scaling model size

How the Three Loops Leverage Test-Time Compute

Section titled “How the Three Loops Leverage Test-Time Compute”

Ralph, RLM, and autoresearch are all concrete realizations of test-time compute scaling:

LoopApplication DomainHow Test-Time Compute Is Used
Ralph LoopCode generationGenerate code → run tests → retry on failure. 10 attempts = 10× the inference tokens
RLMLong-context understandingModel recursively calls itself. Inference compute grows with recursion depth
autoresearchML experimentation5-minute budget × N iterations. Compute invested scales with iteration count

See the pattern? All three call the same model repeatedly while filtering results through deterministic verification conditions. Rather than changing the model itself, they increase the time and number of times the model thinks to ensure quality.

The “22-Point Harness Swing” — The Harness Matters More Than the Model

Section titled “The “22-Point Harness Swing” — The Harness Matters More Than the Model”

The 2026 SWE-bench benchmark provides empirical proof of this lecture’s central claim:

BenchmarkClaude Opus 4.5 ScoreDescription
SWE-bench Verified80.9%Validated issue set, standard scaffold
SWE-bench Pro45.9%More realistic issues, minimal scaffold

The same model yields a 35 pp gap purely from the difference in harness/scaffold. GPT-5.3-Codex leads on Pro at 56.8%, but that too is thanks to OpenAI’s own agent harness.

T2 Scaling: Inference Cost Changes Pretraining Strategy

Section titled “T2 Scaling: Inference Cost Changes Pretraining Strategy”

The T2 scaling laws published in April 2026 (arxiv 2604.01411) dig one level deeper into the economics of test-time compute. The original Chinchilla work (2022) optimized only training compute costs; T2 recalculates the optimal point including inference costs.

Key findings:

  • When inference costs are factored in, overtraining beyond the Chinchilla optimum is advantageous — training a smaller model longer reduces the per-token cost at inference time
  • This directly connects to the loop paradigm: running a small model many times can be more economical than running a large model a few times
  • A feedback loop in which test-time compute scaling works backwards to change train-time strategy

In Claude 4.x, the old budget_tokens (where the developer specifies the token count directly) was deprecated and replaced by an adaptive effort parameter:

effort levelBehaviorUse case
lowShort reasoning, fast responseSimple lookups, type checks
mediumStandard reasoningGeneral coding
highDeep reasoning, more tokensComplex refactoring
maxMaximum reasoning depthArchitecture design, debugging

This reflects the trend of test-time compute allocation moving from developer control to model-autonomous control. Loop paradigm implication: a strategy of dynamically adjusting effort each iteration is possible — explore quickly with low early on, then drill deep with high once a promising direction emerges.


The Ralph Loop (Ralph Wiggum Loop), popularized by developer Geoffrey Huntley in late 2025, is the core paradigm of agentic software development.

Terminal window
# The essence of the Ralph loop — a single line
while :; do cat PROMPT.md | claude-code; done
# The same pattern works with other AI coding CLI tools:
# while :; do cat PROMPT.md | gemini; done
# while :; do codex --approval-mode full-auto "$(cat PROMPT.md)"; done

Two design decisions run through this infinite loop:

  1. A Stop Hook blocks exits — even if the agent tries to terminate itself, the shell restarts it. The agent believes it has “finished,” but the next loop re-reads PROMPT.md and finds incomplete tasks.
  2. A fresh context window opens every iteration — failed attempts from previous loops do not contaminate the context. Git history and the filesystem are the only state stores.

This simplicity is the point. No inter-agent communication, no state DB, no orchestrator. The intelligence of the loop lies not in the loop itself but in the environmental constraints — this is the harness.

The Ralph loop is intentionally monolithic:

Microservices (Complex)

  • Communication errors between agents
  • Non-deterministic architectural failures
  • Undebuggable cascading failures

Ralph Loop (Simple)

  • Single process, single repository
  • Exactly 1 task per loop
  • Predictable execution state
RALPH HARNESS
📄PROMPT.md
🤖AI AgentCode generation attempt
Backpressure System
  • Run compiler
  • Run type checker
  • Run test suite
SuccessRecord in AGENTS.md + next task
FailureGC + record failure in AGENTS.md + retry

The three pillars of the harness:

  1. Backpressure — the compiler, type checker, and test suite automatically reject agent output. No matter how confidently the agent writes code, if pytest fails, that code is as good as nonexistent.
  2. Garbage Collectiongit checkout . completely removes failed code. The repository always returns to the last successful state. Not like a Jenga tower where a wrong block topples everything, but like a potter’s wheel where you rework the clay if the shape isn’t right.
  3. State Tracking — the checklist in PROMPT.md and git history record progress. The agent in the next loop can assess where the previous loop left off.
  • Traditional software: stacking Jenga blocks — one wrong block and everything collapses
  • Ralph loop: clay on the potter’s wheel — if the shape isn’t right, rework it; repeat indefinitely

Cumulative Learning Structure — AGENTS.md

Section titled “Cumulative Learning Structure — AGENTS.md”

The evolved form of the Ralph loop goes beyond simple repetition to a loop that learns. The key is the AGENTS.md file.

# AGENTS.md — Cumulative Learning Record
## Learned Patterns
- The division function in calculator.py requires ZeroDivisionError handling (failed in loop 3)
- pytest fixtures must be placed in conftest.py to prevent import errors (failed in loop 5)
## Forbidden Patterns
- Do not use `eval()` — security risk + bypasses type checker
- Do not share state via global variables — breaks test isolation
## Current Status
- [x] add() implementation complete
- [x] subtract() implementation complete
- [ ] multiply() in progress

At the end of each loop the agent records what it learned in this loop in AGENTS.md. The next loop’s agent reads this file before starting, so it does not repeat the same mistakes.

This is fundamentally different from fine-tuning:

  • Fine-tuning: modifies model weights. Expensive, slow, and hard to reverse.
  • AGENTS.md: written to a text file. Reflected immediately, tracked with git, readable by anyone.

“Failure itself is information” — an agent’s failed attempt is “deterministically bad,” and this information becomes the input for the next loop. If the same task fails 10 or more times in a row, it is judged stuck, and the task is split into smaller units before retrying.

Fresh Context vs Continuous Session — Huntley’s Choice

Section titled “Fresh Context vs Continuous Session — Huntley’s Choice”

In January 2026, an interesting “showdown” took place between Huntley and Anthropic. Anthropic proposed a stop-hook plugin for the Ralph loop that used a continuous session approach (repeating while maintaining the same context).

ApproachMethodAdvantagesDisadvantages
Original RalphNew session every loopContext Rot completely prevented, deterministicLoses context from previous attempts
Anthropic pluginIterate within continuous sessionRemembers previous attempts, faster convergenceContext Rot risk, non-deterministic

Huntley chose fresh context. His reasoning: “If you fail 30 times in a continuous session, the context fills up with failure history and useful reasoning becomes impossible. Better to record just the essentials in AGENTS.md and start clean every time.”

This trade-off connects directly to Week 5’s Context Rot — Ralph’s fresh context strategy is the first solution to Context Rot.

Claude Code /loop — Official Automation Loop

Section titled “Claude Code /loop — Official Automation Loop”

The /loop command officially shipped in Claude Code by Anthropic in 2026 implements the Ralph loop philosophy at the product level. Instead of a manual while shell script, a single line spins up a schedule-based autonomous agent.

Terminal window
# Basic syntax
claude /loop "<instruction>" --every <interval> --for <duration>
# Example: find and fix failing tests every 2 hours, for up to 3 days
claude /loop "check for failing tests and fix them" --every 2h --for 3d

Core design principles:

ElementDescription
Worktree isolationEvery iteration runs via git worktree without affecting the main branch
CLAUDE.md = control planeCLAUDE.md is read every cycle, so modifying the instruction file changes the behavior of a running loop
3-day expiryMaximum --for 3d. Intentional design to prevent context drift in forgotten autonomous agents
3 flags"instruction", --every, --for — this is the entire API surface

Why the 3-day expiry is a feature: if a loop set on Tuesday runs through Friday, by then 15 PRs merged by the team will conflict with it. An agent confidently patching with stale context creates problems harder to debug than the original bug. Re-evaluating and restarting every 72 hours is the safe approach.

Three validated workflows:

Terminal window
claude /loop "check open PRs on the current branch. If CI is failing,
read the error logs, fix the issue, and push. If CI passes and the PR
has no requested changes, post a comment saying 'Ready for human review.'
Summarize what you did in the PR description." --every 30m --for 2d

CI pipeline monitoring, automatic lint/type error fixes, notification when ready. Cannot handle failures requiring business logic judgment.

When /loop fails:

  • Ambiguous tasks: “refactor this module to be more maintainable” — interprets “maintainable” differently each iteration
  • Context drift: divergence between the worktree branch point and the current main in a fast-moving codebase
  • Cost: 30-minute interval × 3 days = 144 iterations. If each iteration processes substantial context, API costs accumulate

The Problem: LLMs Miss the Middle of Long Documents

Section titled “The Problem: LLMs Miss the Middle of Long Documents”

Even with a 200K-token context window, when a long document is passed to an LLM, understanding of the middle sections degrades compared to the beginning and end (the “Lost in the Middle” phenomenon). The existing solutions were two:

  1. RAG (Retrieval-Augmented Generation): chunk the document and retrieve only relevant pieces via search. But you need to know in advance which chunks are relevant — information loss is inevitable.
  2. Summarization: summarize the long document before passing it in. But detail is lost in summarization.

RLM’s Solution: The Model Recursively Calls Itself

Section titled “RLM’s Solution: The Model Recursively Calls Itself”

RLM (Recursive Language Model) takes a fundamentally different approach. It loads the long prompt into Python REPL variables, then has the model write code to extract only the needed portions and recursively call itself.

The core idea: instead of enlarging the context window, let the model decide for itself how to read the context.

# RLM pseudocode — recursive call pattern
def rlm_solve(question: str, documents: list[str]) -> str:
"""The model recursively calls itself to process long documents."""
# Step 1: load all documents into Python variables
context_vars = {f"doc_{i}": doc for i, doc in enumerate(documents)}
# Step 2: ask the model to write code that decides "which parts to read"
planning_code = llm_call(
f"Write Python code to determine which parts of {len(documents)} documents "
f"need to be read to answer the following question.\n"
f"Question: {question}"
)
# Step 3: execute the code → extract only relevant parts
relevant_parts = execute(planning_code, context_vars)
# Step 4: recursively call itself with the extracted parts
if fits_in_context(relevant_parts):
return llm_call(f"Question: {question}\nContext: {relevant_parts}")
else:
# Still too large — recurse again
return rlm_solve(question, split(relevant_parts))

The 2025 results from Zhang et al. are striking: GPT-5-mini + RLM achieved more than 2× the performance of GPT-5 alone on the OOLONG benchmark. A smaller model outperformed a larger model purely through recursive calls.

Why this is possible:

  • No information loss: unlike RAG, the full original document is in a variable. The model can access it at any time.
  • Traceable: the recursion trajectory is left as code. You can trace why the model read a particular section and what logic led it to the answer.
  • The model decides the context exploration strategy: not a summarization or search algorithm, but the model itself expressing in code “which part of the document should I read to answer this question?”

The Ralph Loop and RLM are different applications of the same principle:

ComparisonRalph LoopRLM
Iteration targetCode generationContext exploration
Call patternRepeated calls from a shell loopModel recursively calls itself
Verification conditionTest pass/failAnswer completeness
State storagegit + filesystemPython REPL variables
Common threadBoth call the same model repeatedly to convert test-time compute into reasoning quality

autoresearch — Autonomous Experiment Loop

Section titled “autoresearch — Autonomous Experiment Loop”

autoresearch, published by Andrej Karpathy, applies the loop paradigm to ML research. The idea is strikingly simple:

  1. Give the agent a single train.py
  2. The agent modifies the code freely
  3. After a fixed 5-minute time budget, measure val_bpb (validation bits per byte)
  4. If improved, commit; if not, reset
  5. Run overnight — by morning, an improvement history has accumulated in git log

Write only a research direction in program.md, and the agent autonomously designs and executes specific experiments.

Released on March 7, 2026, it accumulated 21K GitHub stars and 8.6 million views, becoming the symbol of the loop paradigm. The code is characterized by extreme simplicity: 630 lines, 3 files.

Real-World Results — autoresearch by the Numbers

Section titled “Real-World Results — autoresearch by the Numbers”
MetricValue
Total experiments~700 (auto-run overnight)
Optimizations found20
Speed improvement11%
Code size630 lines, 3 files

Shopify CEO Tobi Lütke applied autoresearch to the company’s ML pipeline and achieved 19% validation improvement from 37 experiments run in a single night. What would take a human researcher a week was done by an agent in 8 hours.

These results are captured in what has become known as “The Karpathy Loop”:

agent + single modifiable file + single metric + fixed time limit = automated research

The key is constraint design. One file, one metric, one time limit — the more you restrict the agent’s degrees of freedom, the higher the loop’s quality. This is exactly the same principle as Ralph loop’s “deterministic verification conditions.”

PrincipleDescriptionWhy It Matters
Fixed time budgetEach experiment capped at 5 minutesFair comparison. If one experiment takes 40 minutes, three others can’t run
git branch-basedSuccess = commit, failure = resetFailed experiment artifacts don’t contaminate the next experiment
Single metricOnly val_bpb is measuredRemoves ambiguity. Answers “did it improve?” with a number
program.mdResearch direction text fileSame role as PROMPT.md. Humans define strategy; agents execute tactics

Distributed Research Vision: The SETI@home Pattern

Section titled “Distributed Research Vision: The SETI@home Pattern”

Karpathy envisions the future of autoresearch as distributed research. Just as SETI@home distributed the search for extraterrestrial signals, hundreds of agents each run different experiments in parallel, and only improved results are merged to a central repository. An improvement discovered by one agent becomes the starting point for another agent’s next experiment.

autoresearch and the Ralph Loop follow the same pattern:

ComparisonRalph Loopautoresearch
TargetSoftware codeML training code
Verification conditionTests passval_bpb improvement
State managementgit checkout .git reset
Instruction filePROMPT.mdprogram.md
EssenceDeterministic verification + loop = qualityDeterministic metric + loop = performance

The only difference is the verification condition. Ralph asks “does the code pass tests?”; autoresearch asks “does val_bpb improve?” The rest of the architecture — loop, git-based state management, text-file instructions — is identical.


Integrating the Three Loops — A Common Architecture

Section titled “Integrating the Three Loops — A Common Architecture”
ItemRalph LoopRLMautoresearch
Application domainSoftware developmentLong-context understandingML experimentation
Verification conditionCompile + test passAnswer completenessval_bpb value
State storagegit + filesystemPython REPL variablesgit branch
Context strategyNew context each loop (wiping)Recursive context explorationNew context each experiment
Failure handlinggit checkout .Recursive splittinggit reset
Human roleWrite PROMPT.mdPose the questionWrite program.md

The three essential elements shared by all three loops:

  1. The important thought — PROMPT.md, the question, program.md. Humans define the “what.”
  2. A loop with clear verification conditions — tests, answer completeness, val_bpb. Success/failure is determined deterministically.
  3. Sufficient token budget — the more loops run, the more test-time compute accumulates. The cost is time and tokens.

When these three elements are in place, the loop paradigm can be applied to any domain.

Industry Definition of Harness — Guides vs Sensors

Section titled “Industry Definition of Harness — Guides vs Sensors”

In 2026, Martin Fowler / ThoughtWorks classified the harness into two components in their analysis of agentic coding:

ComponentDirectionRoleExamples
GuidesFeedforwardProvide direction before the agent actsPROMPT.md, CLAUDE.md, linter config, type definitions
SensorsFeedbackMeasure results after the agent actsTest results, token usage, error logs, val_bpb

Re-analyzing the three loops through this framework:

  • Ralph Loop: Guides = PROMPT.md + AGENTS.md, Sensors = compiler + test suite
  • RLM: Guides = question prompt, Sensors = answer completeness judgment
  • autoresearch: Guides = program.md, Sensors = val_bpb metric

LangChain further distills this as Agent = Model + Harness. The model is replaceable; the harness is a collection of domain-specific design decisions.

Economics of Loops — When to Stop Iterating

Section titled “Economics of Loops — When to Stop Iterating”

Running loops costs tokens. Infinite iteration is uneconomical. In practice, you need to calculate the break-even point.

Estimated token cost per iteration (Sonnet 4.6, one code modification):

ItemTokensCost
Input (system prompt + code + error log)~2,000$0.03
Output (modified code + explanation)~4,000$0.30
Total per iteration~6,000~$0.33
ScenarioLoop costComparisonVerdict
10 iterations to fix a bug$3.3Developer 30 min ($25)Loop is 7.5× cheaper
50 iterations for refactoring$16.5Developer 2 hours ($100)Loop is 6× cheaper
200 iterations, fails to converge$66Developer 2 hours ($100)Still cheaper but inefficient
  • Week 2 governance: the moment a loop leaves a side effect in the outside world is the Hard Interrupt point. When a Ralph loop creates a PR, or autoresearch saves a model checkpoint to shared storage, the CUD policy from Week 2 activates.
  • Week 3 MIG/MCP: MIG isolates the loop’s compute. If one student’s loop causes an OOM, it doesn’t affect other students. MCP standardizes tool access within the loop.
  • Week 5 Context Rot: Ralph’s context wiping (fresh context every loop) is the first strategy in the Context Rot solution. Week 5 covers this systematically.

Practical Harness Optimization — Parallel Sessions and Worktree Isolation

Section titled “Practical Harness Optimization — Parallel Sessions and Worktree Isolation”

Parts 1–5 covered the principles of the loop. Now we turn to the practical techniques for running loops faster, safer, and at scale. The 42 tips that Claude Code creator Boris Cherny released over January–February 2026 are not individual tricks — they form a stack in which each layer presupposes the one below. Understanding this stack lets you push Ralph loop throughput from single digits to double digits.

Parallel Sessions — Developer-Level Multiplexing

Section titled “Parallel Sessions — Developer-Level Multiplexing”

Boris runs 5 Claude Code instances in his terminal and 5–10 more on claude.ai/code simultaneously. This is not a multi-agent pipeline. It is one developer supervising multiple loops at the same time — real-world parallelism.

Terminal window
# Set up parallel sessions with tmux
tmux new-session -s loops -d
for i in 1 2 3 4; do tmux split-window -t loops; done
tmux select-layout -t loops tiled
# Run independent tasks in each pane
# pane 0: claude "Refactor frontend components"
# pane 1: claude "Write API endpoint tests"
# pane 2: claude "Update documentation"
# pane 3: claude "Fix lint errors"
# pane 4: claude "Run performance profiling"

The key is that each instance handles an independent task. There is no inter-agent communication. Git is the sole coordination mechanism — a natural extension of the Ralph loop’s monolithic philosophy .

In Part 2 we saw that /loop automatically creates a worktree. The --worktree flag extends this isolation to ad-hoc sessions.

Terminal window
# Basic: give Claude a dedicated worktree
claude --worktree my_feature
# Also auto-create a tmux session
claude --worktree my_feature --tmux
# Real-world pattern: 10 parallel agents for a migration
for module in auth billing users payments notifications \
search analytics admin logging config; do
claude --worktree "migrate-${module}" --tmux \
"Migrate sync I/O in module ${module} to async. \
Open a PR once all tests pass."
done
Terminal window
# Dangerous: all agents share the same working directory
claude "Modify auth module" # ← file collision risk
claude "Modify billing module" # ← file collision risk
# One agent's git checkout . can delete another agent's work

/loop’s worktree is optimized for time-based iteration; --worktree is optimized for parallel task distribution. The two are complementary.

The MIG we learned in Week 3 provided GPU compute isolation — one student’s OOM cannot propagate to another. /sandbox isolates BashTool’s file and network access, providing a different dimension of protection.

DimensionMIG (Week 3)/sandbox (Week 4)
What is isolatedGPU computeFilesystem + network
PurposeResource protection — block one student’s OOMTrust — accept agent edits faster
MechanismHardware partitioningBashTool file/network restriction
Effect”This process cannot touch my GPU""If this agent makes a mistake, the blast radius is clear”

Boris’s core insight: “When you trust the containment, you accept edits faster. That speeds up the whole loop.” Trusting the isolation reduces human review time per loop cycle, raising the total throughput of the loop.

Stack Philosophy — Layers Build on Each Other

Section titled “Stack Philosophy — Layers Build on Each Other”

What makes Boris’s 42 tips pedagogically valuable is that they form a stack, not a menu. Each layer presupposes the one beneath it:

BORIS STACK — Harness Optimization Layers
Layer 1: Plan ModePlan before executing
Layer 2: CLAUDE.mdProject rules are injected every session
Layer 3: Worktree IsolationEach agent works in an independent filesystem
Layer 4: Parallel Sessions + /sandboxRun multiple isolated agents simultaneously
Layer 5: /loop + /batchTime-based autonomous iteration + large-scale batch processing

Tracing the order in reverse explains why it matters:

  • To migrate 10 modules simultaneously with /batch → each needs worktree isolation
  • For agents in a worktree to behave correctly → CLAUDE.md must inject project rules
  • For CLAUDE.md to be effective → the agent must form a plan with plan mode before executing

Using an upper layer without the lower one will fail. Running parallel sessions without worktrees causes file collisions; using worktrees without CLAUDE.md means agents are ignorant of project conventions.

Agentmaxxing — Multi-Tool Parallel Execution

Section titled “Agentmaxxing — Multi-Tool Parallel Execution”

Starting in early 2026, an extreme parallelization strategy called “Agentmaxxing” emerged: deploying multiple AI coding tools simultaneously in a single repo.

Terminal window
# Terminal 1: Claude Code (architecture design)
claude --worktree arch "Refactor module structure"
# Terminal 2: Codex (test writing)
codex --approval-mode full-auto "Add missing test cases"
# Terminal 3: Gemini CLI (documentation)
gemini "Auto-generate API docs from code"

Cursor 2.0 productized this pattern with Background Agents (running in isolated VMs) and Mission Control (a parallel agent dashboard). Codex CLI recorded 1.6 million weekly active users as of March 2026, establishing itself as the reference implementation for open-source harnesses.


Real-World Validation — Building a Product Without Writing a Line of Code

Section titled “Real-World Validation — Building a Product Without Writing a Line of Code”

We have covered the theory and techniques. The question now: can loops alone build production software? In March 2026, OpenAI’s case of building an internal product using only the Codex agent provides the answer.

The OpenAI engineering team used Codex to build an internal product:

  • 1M+ LOC generated automatically
  • 0 lines written manually
  • Humans performed only requirements definition, architecture review, and PR approval

The five patterns derived from this process are a field manual for harness engineering:

PatternPrincipleRalph Loop Perspective
Repo as System of RecordCode = single source of truth. All decisions reflected in code, not verbal agreements or wikisPROMPT.md + AGENTS.md serve this role
Application LegibilityWrite code that agents can read. Clear variable names, types, and comments are prerequisites for correct agent modificationPrerequisite for backpressure — linters and type checkers need readable code to function
Layered Domain ArchitectureClearly separate domain layers. Changes to one layer don’t propagate to othersPrerequisite for parallel worktrees — modules must be separated for parallel modification
Minimal Merge GatesMinimize merge gates. Auto-merge on test passThe key to loop speed — approval wait time determines loop efficiency
Entropy ManagementActively manage codebase disorder (entropy). Prevent technical debt accumulation via loops/simplify pattern — a separate agent periodically cleans up code quality

  1. Ralph’s “fresh context every loop” and RLM’s “recursive calls” solve the context window problem in opposite directions. Can these two be combined? What form would it take?
  2. Why is autoresearch’s 5-minute fixed budget important? What problems arise if you run it “until improvement” with no budget?
  3. What is the difference between recording learnings in AGENTS.md in the Ralph loop and fine-tuning the model? What are the advantages and disadvantages of each?
  4. If you ran 100 loops during your 8 hours of sleep, what task would you apply it to? What would you set as the verification condition?
  5. What are the limits of the loop paradigm? What problems can’t be solved by iteration?
  6. In Boris’s stack, what problems arise if you use only worktrees without CLAUDE.md? Conversely, what if you use only CLAUDE.md without worktrees? Explain the inter-layer dependency with a concrete scenario.
  7. Do you agree with the claim that /sandbox’s purpose is trust, not security? Discuss specifically what behavioral changes “isolation increases trust” actually implies.
  8. A model scoring 80.9% on SWE-bench Verified drops to 45.9% on Pro. Can the 35 pp gap be closed by the model alone, without a harness? If doubling model size doesn’t close the gap, what does that tell us?
  9. If Codex auto-generated 1M lines, what is the programmer’s role? What does “Application Legibility” — one of the five patterns — suggest: is the ability to write code that agents can read becoming the programmer’s new core competency?

Implementing the Ralph Loop with Cumulative Learning

Section titled “Implementing the Ralph Loop with Cumulative Learning”
  1. Set up the project structure

    Terminal window
    mkdir ralph-project && cd ralph-project
    git init
    touch PROMPT.md AGENTS.md
    mkdir tests
  2. Write PROMPT.md

    # Current Task
    Implement the following items in order:
    - [ ] Implement add(a, b) function in calculator.py
    - [ ] Implement subtract(a, b) function in calculator.py
    - [ ] Implement divide(a, b) function in calculator.py (handle ZeroDivisionError)
    # Constraints
    - Implement only one function at a time
    - Include type hints in all functions
    - Do not write code without tests
    - Write tests in tests/test_calculator.py before implementing
    # State Tracking
    - Always read AGENTS.md before starting
    - On failure: record the cause in AGENTS.md, then exit
    - On success: record the success pattern in AGENTS.md, then proceed to the next task
  3. Write harness.sh — backpressure + garbage collection

    #!/bin/bash
    # harness.sh — Ralph loop harness
    set -e
    MAX_RETRIES=10
    RETRY_COUNT=0
    while true; do
    echo "=== Ralph Loop #$((RETRY_COUNT + 1)) ==="
    # Run the agent
    cat PROMPT.md | claude-code
    # Backpressure: type check + test run
    if python -m py_compile calculator.py 2>/dev/null && \
    python -m pytest tests/ -q 2>/dev/null; then
    echo "Tests passed — committing and moving on"
    git add -A && git commit -m "loop $((RETRY_COUNT + 1)): task completed"
    RETRY_COUNT=0
    else
    echo "Tests failed — garbage collection + retry"
    RETRY_COUNT=$((RETRY_COUNT + 1))
    # Stuck detection
    if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
    echo "STUCK: $MAX_RETRIES consecutive failures. Splitting task."
    echo "- STUCK at loop $RETRY_COUNT: $(date)" >> AGENTS.md
    git add AGENTS.md && git commit -m "stuck: recorded failure pattern"
    RETRY_COUNT=0
    fi
    # Garbage Collection — remove failed code
    git checkout -- calculator.py tests/ 2>/dev/null || true
    sleep 2
    fi
    done
  4. Initialize the AGENTS.md cumulative learning structure

    # AGENTS.md — Cumulative Learning Record
    ## Learned Patterns
    (filled automatically after loop runs)
    ## Forbidden Patterns
    (anti-patterns learned from repeated failures)
    ## Progress Status
    - [ ] add() implementation
    - [ ] subtract() implementation
    - [ ] divide() implementation
  5. Run and observe — minimum 3 loops

    Terminal window
    chmod +x harness.sh
    ./harness.sh

    Observation points:

    • Does the agent read AGENTS.md before starting?
    • Is learning recorded in AGENTS.md after a failure?
    • Does the next loop avoid repeating the same mistake?
    • Does git log --oneline show only successful commits (no failed code committed)?
  6. Mini autoresearch pattern experiment (optional)

    An autoresearch pattern to optimize the execution time of a simple Python function:

    mini_autoresearch.sh
    #!/bin/bash
    BEST_TIME=999
    while true; do
    # Agent modifies optimize.py
    cat program.md | claude-code
    # Run benchmark with 5-minute time limit
    CURRENT_TIME=$(timeout 300 python benchmark.py 2>/dev/null || echo 999)
    if (( $(echo "$CURRENT_TIME < $BEST_TIME" | bc -l) )); then
    echo "Improved: $BEST_TIME -> $CURRENT_TIME"
    BEST_TIME=$CURRENT_TIME
    git add -A && git commit -m "improved: time=$CURRENT_TIME"
    else
    echo "No improvement ($CURRENT_TIME >= $BEST_TIME) — reset"
    git checkout .
    fi
    done
  • Does PROMPT.md specify 3 or more tasks?
  • Does AGENTS.md contain at least 2 learning records?
  • Does harness.sh implement both backpressure (test runs) and garbage collection (git checkout .)?
  • Does git log show only successful commits? (No failed code committed?)
  • Is the failure → learning → retry → success process visible in the terminal log?
  • Does stuck detection logic exist? (optional)

Submission deadline: 2026-03-31 23:59

Submission path: assignments/week-04/[student ID]/

Required deliverables (5 items):

  1. harness.sh — Ralph harness with backpressure + garbage collection
  2. PROMPT.md — specification of at least 3 tasks
  3. AGENTS.md — cumulative learning record after loop runs (must show failure-after-failure-then-success across at least 2 failures)
  4. Loop execution log — terminal log showing the failure → learning → retry process (loop_log.txt)
  5. README.md — harness design decisions + explanation of the connection to test-time compute scaling

Bonus items (5 items):

  1. Mini autoresearch pattern implementation — Python function optimization with fixed time budget
  2. Stuck detection logic — on N consecutive failures, split the task into smaller units and retry
  3. Loop metric collection — record per-iteration token usage, success rate, and elapsed time as CSV/JSON
  4. Analysis report on worktree isolation behavior after running Claude Code /loop
  5. Long-document analysis experiment using RLM principles — include results and reflection in README

Evaluation criteria:

  • Garbage collection mechanism operates correctly
  • Evidence that AGENTS.md cumulative learning is actually reflected in subsequent loops
  • Error log packaging and state tracking file utilization
  1. Test-time compute scaling: performance improves by investing more compute at inference time without growing the model. All three loops are implementations of this principle.
  2. Ralph Loop = harness + loop: the harness (backpressure + garbage collection) controls non-deterministic agents deterministically. AGENTS.md enables cumulative learning.
  3. RLM = recursive context exploration: the model recursively calls itself to process long documents. It explores context without information loss, and the recursion trajectory is preserved as code.
  4. autoresearch = autonomous experiment loop: fixed time budget + single metric + git-based state management. Same pattern as Ralph, differing only in the verification condition.
  5. Three common elements: the important thought (instruction file) + clear verification conditions (loop) + sufficient token budget. Once these are in place, the loop paradigm can be applied to any domain.
  6. Infrastructure connections: Week 2 governance sets safety boundaries for loops, Week 3 MIG provides compute isolation and MCP standardizes tools. Ralph’s context wiping is the starting point for the Context Rot solution in Week 5.