Skip to content

Week 5: Context Management and Preventing Context Rot

Phase 2Week 5IntermediateLecture: 2026-03-31

Concepts

Define Context Rot and explain why it persists in long-context models in terms of attention dilution and instruction drift.

Design

Compare three strategies — context window wiping, summarization, state-tracking files — and decide which combination fits a Ralph loop.

Implementation

Author tasks.md (or an equivalent state file) that serializes compact progress every turn in a token-efficient way.

Operations

Measure cache hit ratio, prompt tokens, and summarization cost to report the ROI of your context strategy quantitatively.

Why Context Management is Central in Week 5

Section titled “Why Context Management is Central in Week 5”

In Week 4 we learned the power of the loop paradigm — calling the same model repeatedly while using deterministic validation to ensure quality. But for a loop to work, there is a prerequisite: the context must be clean.

Week 4’s Huntley knew about this problem. That’s why he chose a “fresh context” strategy — starting a new session on every loop iteration. But that alone isn’t enough:

  • Even inside a loop, context gets polluted — tool call results, failed attempts, and error messages accumulate
  • State must be passed between sessions — restarting from scratch every time loses the learning from the previous loop
  • The prerequisite for Week 6’s instruction tuning: even if you add constraints in CLAUDE.md, a polluted context will cause the agent to “forget” those constraints

The central question for this week: how do we manage context deterministically?

To answer that question, we’ll first confirm with empirical data why Context Rot happens and how serious it is, then build up solutions from there.


Context Rot is the phenomenon where an agent’s context window becomes increasingly polluted over a long session:

Clean Context (initial)[System Prompt] [Task Spec] [Current Code]
Context Rot (after 30 attempts)[System Prompt] [Task Spec] [Failure #1] [Error #1] [Failure #2] [Error #2] … 128K tokens → hallucinations occur, reasoning quality drops sharply

Empirical Data: Chroma’s Context Rot Research

Section titled “Empirical Data: Chroma’s Context Rot Research”

“Longer context gets worse” is intuitive, but Chroma’s 2026 empirical study provides a representative quantitative measurement of how much worse it gets. Testing 18 frontier models (Claude, GPT, Gemini, Llama, and others):

FindingData
Accuracy drop at mid-window position30%+
Correlation between input length and accuracyNegative across all models — no exceptions
Counter-intuitive resultShuffled documents scored higher accuracy than logically ordered documents

The last finding is especially important. When documents are arranged in logical order, models tend to judge “I already saw this earlier, I can skim the rest.” Shuffling forces attention at every position, which actually raises accuracy.

The 1M Token Era — Is a Larger Window the Solution?

Section titled “The 1M Token Era — Is a Larger Window the Solution?”

Frontier model context windows as of 2026:

ModelOfficial ContextEffective Usage
current Claude frontier family1M-token class~600-700K
GPT-5.41M tokens~600-700K
Gemini 2.5 Pro1M tokens~600-700K

The reason effective usage is 60-70%: the remainder is consumed by the system prompt (~50K), tool schemas (~30K), and safety margin (~200K).

Compaction Strategy — What to Discard, What to Keep

Section titled “Compaction Strategy — What to Discard, What to Keep”

When auto-compaction fires, preservation priority is decisive:

PriorityWhat to KeepWhy
1 (highest)System prompt + CLAUDE.mdThe agent’s “constitution” — losing this erases behavioral rules
2Last 4 messagesImmediate context of the current task
3Tool results for the current taskThe file just read, the test just run
4 (lowest)Old conversation + previous tool resultsCan be replaced with a summary

When NOT to compress: In the following situations, it is better to end the session and start fresh rather than compact:

  • The conversation topic has completely shifted (previous context is a hindrance)
  • The same error has repeated 5+ times in a row (Context Rot is already severe)
  • The summary itself is 50%+ the size of the original (no compression benefit)

This is why Huntley in Week 4 chose “fresh context” — between loop iterations, full reset + state file handoff is more deterministic than compaction.

The Ralph Loop Solution: Context Window Wiping

Section titled “The Ralph Loop Solution: Context Window Wiping”

One of the key innovations of the Ralph Loop is completely resetting the context after a task completes or fails:

class RalphContextManager:
def __init__(self, max_tokens: int = 200_000):
self.max_tokens = max_tokens
self.state_file = "claude-progress.txt"
def should_wipe_context(self, current_tokens: int) -> bool:
"""Reset context when more than 75% of the window is used"""
return current_tokens > self.max_tokens * 0.75
def build_fresh_context(self) -> str:
"""Deterministically reconstruct context from the state file"""
state = self.load_state()
return f"""
# Project State
{state['completed_tasks']}
# Current Task
{state['current_task']}
# Relevant Code (current version only)
{state['relevant_code_snippet']}
"""
def save_state(self, task: str, status: str):
"""Save state for the next loop iteration"""
with open(self.state_file, 'a') as f:
f.write(f"[{status}] {task}\n")
claude-progress.txtRecords completed/failed tasks
fix_plan.mdStructured task queue
@codebase_map.mdFile structure snapshot (kept up to date)

fix_plan.md Template:

# Project: Calculator App
## Completed Tasks
- [x] Create basic file structure (2026-03-31 14:23)
- [x] Implement add() function and pass tests (2026-03-31 14:45)
## Current Task
- [ ] Implement subtract() function
- Expected file: calculator.py:15-25
- Related tests: tests/test_calculator.py:20-35
## Pending Tasks
- [ ] multiply() function
- [ ] divide() function (must handle division-by-zero exception)

The ultimate goal of context management is to do more useful work on the same budget. Empirical data shows that 40-70% of agent input tokens are wasted — duplicate tool results, unnecessary file contents, bloated system prompts.

Model Routing — You Don’t Need Opus for Everything

Section titled “Model Routing — You Don’t Need Opus for Everything”
Task TypeShareRecommended ModelCost (1M tokens)
Simple lookups, formatting, type checking60-70%Haiku$1 / $5
Standard coding, bug fixes, feature additions25-30%Sonnet$15 / $75
Architecture design, complex debugging5-10%Opus$15 / $75

Model routing alone enables 5-8x cost reduction. Claude Code’s effort parameter (see Week 4) is the productized form of this routing.

Prompt Caching — Turning Repetition into an Asset

Section titled “Prompt Caching — Turning Repetition into an Asset”

On every agent turn, the system prompt, tool schemas, and CLAUDE.md contain the same content repeated. Prompt caching stores this static portion and reuses it:

OperationPrice (vs. baseline)
Cache write (5-min TTL)1.25x
Cache write (1-hour TTL)2x
Cache read0.1x (90% savings)

Implications for the loop paradigm:

  • Continuous session: Create cache on first turn → read at 0.1x on subsequent turns = very economical
  • Ralph fresh context: New session every loop → cache must be recreated = higher cost
  • Trade-off: Context Rot prevention (fresh) vs. cache efficiency (continuous). Same problem as Week 4’s Huntley Showdown

The Initializer Pattern — 2-Phase State Management

Section titled “The Initializer Pattern — 2-Phase State Management”

The 2-phase pattern recommended in Anthropic’s official harness guide systematizes the state file design above:

Phase 1 — Initializer (first loop):

  1. Parse requirements and generate a feature list as JSON
  2. Initialize claude-progress.txt
  3. Generate init.sh (environment setup script)
{
"features": [
{"id": "F001", "name": "User Authentication", "status": "pending", "files": ["src/auth.py"]},
{"id": "F002", "name": "Dashboard UI", "status": "pending", "files": ["src/dashboard.py"]}
],
"constraints": ["pytest must pass", "100% type hints"]
}

Phase 2 — Coding Agent (subsequent loops):

  1. Run init.sh to configure the environment
  2. Pull a "status": "pending" item from the JSON and work on it
  3. On completion: set "status": "done" + record in claude-progress.txt
  4. The next loop reads the JSON and picks up from remaining items

This pattern is a higher-level abstraction of the three state files in today’s Week 5 (claude-progress.txt, fix_plan.md, @codebase_map.md). The JSON feature list replaces fix_plan.md; init.sh replaces @codebase_map.md.


Context management approaches differ by tool. As of 2026, three strategies are competing:

StrategyRepresentative ToolApproachProsCons
ExplicitCursorUser manually selects which files go into contextPrecise control, minimal token wasteManual labor, risk of omission
AmbientWindsurf (Cascade)Tool automatically detects relevant filesConvenient, prevents omissionsRisk of over-inclusion, token waste
HybridClaude CodeFile-based persistence (CLAUDE.md) + auto-compactionBalanced, loop-friendlyRequires setup, learning curve

VS Code Copilot introduced 3-tier memory (user/repository/session) in 2026, separating user preferences (global) → project rules (repo) → current conversation (session). This is the same design principle as Claude Code’s 3-level CLAUDE.md hierarchy (global/project/local).


When agents send structured data to LLMs, the serialization format can cause 2–3x differences in token consumption. Results from serializing the same 50-user list in 7 formats:

FormatTokensLLM AccuracyBest For
CSV~80044.3%Pure tables (accuracy risk)
Markdown-KV~95060.7%Simple key-value retrieval
TOON99376.4%Uniform array data
JSON (compact)~1,10073.7%General purpose — safest balance
JSON (pretty)1,48175.0%When human readability is needed
YAML1,71074.5%Nested configs, prompt structuring
XML2,69072.1%Legacy system integration

TOON — A Case Study in Token-Optimized Serialization

Section titled “TOON — A Case Study in Token-Optimized Serialization”

TOON (Token Oriented Object Notation, 2025) preserves JSON’s structure while removing quotes, braces, and commas, representing uniform arrays as CSV-style tables. With 23.7K GitHub stars and 1.6M monthly npm downloads, community interest is real.

// TOON example: uniform array → header + row format
users[3]{id,name,email}:
1,Alice,alice@example.com
2,Bob,bob@example.com
3,Carol,carol@example.com

Strengths: 40–60% token reduction on uniform arrays vs pretty JSON. Limitations: Can be 15–20% larger than compact JSON for non-uniform/nested data. The spec is a Working Draft (v3.0 reached in just 3 weeks from v0.8), and LLMs haven’t been trained on TOON, requiring format explanation in prompts.

  • Default: Compact JSON — all LLMs understand it, mature ecosystem. 25–40% savings vs pretty JSON.
  • Large uniform tables (100+ rows): Consider TOON or CSV — but always measure the accuracy trade-off.
  • API-based structured output: Function calling / structured output APIs are optimal (Microsoft measured: 42% savings vs JSON).
  • Week 7 multi-agent: Design inter-agent artifacts with compact JSON. See Week 7 Artifact Handoff.

  1. Does a 1M token context solve Context Rot? Answer using evidence from the Chroma research data.
  2. Ralph’s fresh context vs. continuous session — when you factor in caching costs (fresh = recreate every time, continuous = read at 0.1x), which is more economical? Under what conditions does this flip?
  3. What is the basis for the advice to keep CLAUDE.md under 200 lines? Connect your answer to SkillReducer’s “less-is-more” effect.
  4. Reason why shuffled documents scored higher accuracy than ordered documents in the Chroma study. What does this imply for context design?
  5. Why does the Initializer pattern store the feature list as JSON? What problem arises if you use Markdown instead?

  1. Measure Token Usage

    Run the /cost command in a Claude Code session to check the current session’s token usage. After 10 turns of conversation, measure again and record the increase.

    import os
    import anthropic
    def count_tokens(messages: list) -> int:
    client = anthropic.Anthropic()
    response = client.messages.count_tokens(
    model=os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-5"),
    messages=messages
    )
    return response.input_tokens
  2. Before/After Compaction Comparison

    In a session with 20+ turns of conversation, run the /compact command. Record and compare token counts, response quality, and task continuity before and after compression.

  3. Build a State File System

    Write helper functions that automatically update fix_plan.md and claude-progress.txt. Refer to state_tracker.py in Lab 05.

  4. Connect to Lab 05

    Based on the experiment results above, implement the four modules in Lab 05: token_counter.py, context_manager.py, state_tracker.py, and main.py.

Submission deadline: 2026-04-07 23:59

Requirements:

  1. Ralph Loop with integrated token counter (ralph_with_context.sh)
  2. Context Rot simulation and a graph of measurement results
  3. Automatic context reset logic implementation
  4. Demonstration that the state tracking system (fix_plan.md + claude-progress.txt) works correctly

  1. Context Rot is an empirical phenomenon — Chroma study: accuracy degrades as input grows longer across all 18 models. 30%+ degradation at mid-window positions.
  2. A 1M token window is not the solution — a bigger window just means a larger space for Context Rot to occur in. Effective usage is 60-70%.
  3. Compaction is token-based — auto-triggered at ~75% of model maximum. Preserves the last 4 messages, summarizes the rest.
  4. Ralph’s fresh context is the default between loops — use compaction inside a loop, full reset + state file handoff between loops.
  5. 40-70% of tokens are wasted — cut costs with model routing (Haiku 60-70%), prompt caching (read at 0.1x), and keeping CLAUDE.md under 200 lines.
  6. Initializer pattern — manage state deterministically with a 2-phase structure: JSON feature list + claude-progress.txt + init.sh.