Skip to content

Week 5: Context Management and Preventing Context Rot

Phase 2Week 5IntermediateLecture: 2026-03-31

Why Context Management is Central in Week 5

Section titled “Why Context Management is Central in Week 5”

In Week 4 we learned the power of the loop paradigm — calling the same model repeatedly while using deterministic validation to ensure quality. But for a loop to work, there is a prerequisite: the context must be clean.

Week 4’s Huntley knew about this problem. That’s why he chose a “fresh context” strategy — starting a new session on every loop iteration. But that alone isn’t enough:

  • Even inside a loop, context gets polluted — tool call results, failed attempts, and error messages accumulate
  • State must be passed between sessions — restarting from scratch every time loses the learning from the previous loop
  • The prerequisite for Week 6’s instruction tuning: even if you add constraints in CLAUDE.md, a polluted context will cause the agent to “forget” those constraints

The central question for this week: how do we manage context deterministically?

To answer that question, we’ll first confirm with empirical data why Context Rot happens and how serious it is, then build up solutions from there.


Context Rot is the phenomenon where an agent’s context window becomes increasingly polluted over a long session:

Clean Context (initial)[System Prompt] [Task Spec] [Current Code]
Context Rot (after 30 attempts)[System Prompt] [Task Spec] [Failure #1] [Error #1] [Failure #2] [Error #2] … 128K tokens → hallucinations occur, reasoning quality drops sharply

Empirical Data: Chroma’s Context Rot Research

Section titled “Empirical Data: Chroma’s Context Rot Research”

“Longer context gets worse” is intuitive, but how much worse was first quantified by Chroma’s 2026 empirical study. Testing 18 frontier models (Claude, GPT, Gemini, Llama, and others):

FindingData
Accuracy drop at mid-window position30%+
Correlation between input length and accuracyNegative across all models — no exceptions
Counter-intuitive resultShuffled documents scored higher accuracy than logically ordered documents

The last finding is especially important. When documents are arranged in logical order, models tend to judge “I already saw this earlier, I can skim the rest.” Shuffling forces attention at every position, which actually raises accuracy.

Why It Happens — Three Compounding Mechanisms

Section titled “Why It Happens — Three Compounding Mechanisms”

The Chroma study established how much accuracy degrades. Follow-up research identified why — three mechanisms that amplify each other:

MechanismDescriptionScaling
Lost-in-the-middleInformation in mid-context positions is retrieved less accurately than at the start or endPosition-dependent
Attention dilutionAs context grows, each token’s share of attention decreasesQuadratic — doubling length quadruples dilution
Distractor interferenceIrrelevant information actively degrades reasoning about relevant informationProportional to noise ratio

An important concept from this research: MECW (Maximum Effective Context Window) — the point at which a model’s accuracy on in-context information drops below a usable threshold. MECW ≠ the advertised maximum context. In benchmarks, top models begin to fail well before their nominal limits. This is why effective usage in the table below is 60-70%, not 100%.

The 1M Token Era — Is a Larger Window the Solution?

Section titled “The 1M Token Era — Is a Larger Window the Solution?”

Frontier model context windows as of 2026:

ModelOfficial ContextEffective Usage
Claude Opus/Sonnet 4.61M tokens~600-700K
GPT-5.41M tokens~600-700K
Gemini 2.5 Pro1M tokens~600-700K

The reason effective usage is 60-70%: the remainder is consumed by the system prompt (~50K), tool schemas (~30K), and safety margin (~200K).

Compaction Strategy — What to Discard, What to Keep

Section titled “Compaction Strategy — What to Discard, What to Keep”

When auto-compaction fires, preservation priority is decisive:

PriorityWhat to KeepWhy
1 (highest)System prompt + CLAUDE.mdThe agent’s “constitution” — losing this erases behavioral rules
2Last 4 messagesImmediate context of the current task
3Tool results for the current taskThe file just read, the test just run
4 (lowest)Old conversation + previous tool resultsCan be replaced with a summary

When NOT to compress: In the following situations, it is better to end the session and start fresh rather than compact:

  • The conversation topic has completely shifted (previous context is a hindrance)
  • The same error has repeated 5+ times in a row (Context Rot is already severe)
  • The summary itself is 50%+ the size of the original (no compression benefit)

This is why Huntley in Week 4 chose “fresh context” — between loop iterations, full reset + state file handoff is more deterministic than compaction.

Beyond Heuristics — The ACON Compression Framework

Section titled “Beyond Heuristics — The ACON Compression Framework”

Current compaction strategies (including Claude Code’s) are heuristic-based — rules like “keep last 4 messages” and “summarize the rest.” ACON (Agent Context Optimization, arXiv:2510.00615, October 2025) proposes treating compression as a formal optimization problem:

MetricResult
Peak token reduction26-54%
Accuracy preservation95%+
MethodGradient-free, API-compatible (works with any model via API)

ACON’s key insight: instead of fixed rules, dynamically score each context segment’s contribution to the current task and prune those below a threshold. This is the difference between “always keep the last 4 messages” (heuristic) and “keep the messages most relevant to the current task” (optimization).

Claude Code currently uses heuristics, but ACON represents the direction the field is moving — from rule-based to optimization-based compaction.

The Ralph Loop Solution: Context Window Wiping

Section titled “The Ralph Loop Solution: Context Window Wiping”

One of the key innovations of the Ralph Loop is completely resetting the context after a task completes or fails:

class RalphContextManager:
def __init__(self, max_tokens: int = 200_000):
self.max_tokens = max_tokens
self.state_file = "claude-progress.txt"
def should_wipe_context(self, current_tokens: int) -> bool:
"""Reset context when more than 75% of the window is used"""
return current_tokens > self.max_tokens * 0.75
def build_fresh_context(self) -> str:
"""Deterministically reconstruct context from the state file"""
state = self.load_state()
return f"""
# Project State
{state['completed_tasks']}
# Current Task
{state['current_task']}
# Relevant Code (current version only)
{state['relevant_code_snippet']}
"""
def save_state(self, task: str, status: str):
"""Save state for the next loop iteration"""
with open(self.state_file, 'a') as f:
f.write(f"[{status}] {task}\n")
claude-progress.txtRecords completed/failed tasks
fix_plan.mdStructured task queue
@codebase_map.mdFile structure snapshot (kept up to date)

fix_plan.md Template:

# Project: Calculator App
## Completed Tasks
- [x] Create basic file structure (2026-03-31 14:23)
- [x] Implement add() function and pass tests (2026-03-31 14:45)
## Current Task
- [ ] Implement subtract() function
- Expected file: calculator.py:15-25
- Related tests: tests/test_calculator.py:20-35
## Pending Tasks
- [ ] multiply() function
- [ ] divide() function (must handle division-by-zero exception)

The ultimate goal of context management is to do more useful work on the same budget. Empirical data shows that 40-70% of agent input tokens are wasted — duplicate tool results, unnecessary file contents, bloated system prompts.

Model Routing — You Don’t Need Opus for Everything

Section titled “Model Routing — You Don’t Need Opus for Everything”
Task TypeShareRecommended ModelCost (1M tokens)
Simple lookups, formatting, type checking60-70%Haiku$1 / $5
Standard coding, bug fixes, feature additions25-30%Sonnet$3 / $15
Architecture design, complex debugging5-10%Opus$5 / $25

Model routing alone enables 5-8x cost reduction. Claude Code’s effort parameter (see Week 4) is the productized form of this routing.

Prompt Caching — Turning Repetition into an Asset

Section titled “Prompt Caching — Turning Repetition into an Asset”

On every agent turn, the system prompt, tool schemas, and CLAUDE.md contain the same content repeated. Prompt caching stores this static portion and reuses it:

OperationPrice (vs. baseline)
Cache write (5-min TTL)1.25x
Cache write (1-hour TTL)2x
Cache read0.1x (90% savings)

Implications for the loop paradigm:

  • Continuous session: Create cache on first turn → read at 0.1x on subsequent turns = very economical
  • Ralph fresh context: New session every loop → cache must be recreated = higher cost
  • Trade-off: Context Rot prevention (fresh) vs. cache efficiency (continuous). Same problem as Week 4’s Huntley Showdown

The Initializer Pattern — 2-Phase State Management

Section titled “The Initializer Pattern — 2-Phase State Management”

The 2-phase pattern recommended in Anthropic’s official harness guide systematizes the state file design above:

Phase 1 — Initializer (first loop):

  1. Parse requirements and generate a feature list as JSON
  2. Initialize claude-progress.txt
  3. Generate init.sh (environment setup script)
{
"features": [
{"id": "F001", "name": "User Authentication", "status": "pending", "files": ["src/auth.py"]},
{"id": "F002", "name": "Dashboard UI", "status": "pending", "files": ["src/dashboard.py"]}
],
"constraints": ["pytest must pass", "100% type hints"]
}

Phase 2 — Coding Agent (subsequent loops):

  1. Run init.sh to configure the environment
  2. Pull a "status": "pending" item from the JSON and work on it
  3. On completion: set "status": "done" + record in claude-progress.txt
  4. The next loop reads the JSON and picks up from remaining items

This pattern is a higher-level abstraction of the three state files in today’s Week 5 (claude-progress.txt, fix_plan.md, @codebase_map.md). The JSON feature list replaces fix_plan.md; init.sh replaces @codebase_map.md.


Persistent Knowledge Systems — The LLM Wiki Pattern

Section titled “Persistent Knowledge Systems — The LLM Wiki Pattern”

While the Initializer pattern manages state for a single task, the LLM Wiki pattern accumulates knowledge across an entire project permanently. Proposed by Karpathy, its core principle is “compile knowledge once, keep it current.”

Architecture:

LayerRoleExample
Raw SourcesImmutable original materialsPaper PDFs, meeting recordings, codebases
WikiLLM-generated and maintained markdown filesSummaries, cross-references, entity pages
SchemaConfiguration file defining structure and workflowsWiki rules described in CLAUDE.md

Three Operations:

  1. Ingest — Read new source material, write summaries, update related wiki pages
  2. Query — Search the wiki to synthesize answers, file new discoveries back into the wiki
  3. Lint — Check for contradictions, stale information, orphaned pages, missing cross-references

This pattern is the evolutionary next step beyond the Initializer: 3 state files (single task) → JSON feature list (multi-feature project) → Wiki (persistent knowledge graph). The ultimate form of context management is not re-understanding everything each session, but referencing accumulated knowledge.


Knowledge Graph-Based Token Reduction — Graphify

Section titled “Knowledge Graph-Based Token Reduction — Graphify”

Instead of putting raw files directly into context, transforming a codebase into a knowledge graph and querying only the needed parts can dramatically reduce token consumption. Graphify is a practical implementation of this approach.

How it works:

  • Code analysis: tree-sitter AST extraction parses classes, functions, imports, and call graphs locally (no LLM calls needed, code never leaves your machine)
  • Document/media analysis: Semantic extraction from PDFs, images, videos via LLM
  • Graph construction: NetworkX + Leiden clustering for community detection, interactive HTML visualization + JSON output
  • Confidence tagging: Relationships classified as EXTRACTED (confirmed), INFERRED (with confidence scores), or AMBIGUOUS

Token efficiency: 71.5x token reduction compared to reading raw files directly on mixed corpora. Once the graph is built, only incremental updates via SHA256 caching are needed.

This approach can be combined with the LLM Wiki pattern: use Graphify to graph the codebase structure, then layer insights and decisions on top with an LLM Wiki.

LazyGraphRAG — Lightweight Graph Retrieval at Scale

Section titled “LazyGraphRAG — Lightweight Graph Retrieval at Scale”

While Graphify graphs a single codebase locally, Microsoft’s LazyGraphRAG applies graph-based retrieval to enterprise-scale document collections:

MetricLazyGraphRAG vs Full GraphRAG
Indexing cost0.1% (1/1000)
Query cost4% (1/25)
QualityOutperforms vector RAG, RAPTOR, and competing methods

LazyGraphRAG builds graphs via lazy evaluation without pre-summarization, achieving retrieval quality superior even to 1M-token context windows. It is deployed in production on Microsoft Discovery (Azure).

Graphify (local codebase, individual use) vs LazyGraphRAG (massive document sets, enterprise) — two points on the cost-quality spectrum for knowledge graph approaches.


Context management approaches differ by tool. As of 2026, three strategies are competing:

StrategyRepresentative ToolApproachProsCons
ExplicitCursorUser manually selects which files go into contextPrecise control, minimal token wasteManual labor, risk of omission
AmbientWindsurf (Cascade)Tool automatically detects relevant filesConvenient, prevents omissionsRisk of over-inclusion, token waste
HybridClaude CodeFile-based persistence (CLAUDE.md) + auto-compactionBalanced, loop-friendlyRequires setup, learning curve
TieredGitHub Copilot3-tier memory (user/repo/session) + Agent ModeFamiliar VS Code UX, 28-day auto-memoryVS Code-centric, limited CLI control
SandboxedCodex CLIContainer isolation + AGENTS.mdSafest execution, tool-neutralNo MCP support, 192K context limit
Large-windowGemini CLI1M context + GEMINI.mdFree tier, largest contextNewer ecosystem, less tooling
AutonomousDevinFull autonomous sessionsEnd-to-end automationDegrades after ~35 min / 10 ACUs

VS Code Copilot introduced 3-tier memory (user/repository/session) in 2026, separating user preferences (global) → project rules (repo) → current conversation (session). This is the same design principle as Claude Code’s 3-level CLAUDE.md hierarchy (global/project/local).

Instruction File Convergence — The Industry Validates File-Based Context

Section titled “Instruction File Convergence — The Industry Validates File-Based Context”

A remarkable trend in 2025-2026: every major AI coding tool independently converged on file-based context management. They compete on nearly everything else, but arrived at the same architecture for context management:

ToolInstruction FileHierarchy
Claude CodeCLAUDE.md3-level: ~/.claude/ (global) → project/.claude/ (project) → workspace/ (local)
OpenAI Codex CLIAGENTS.mdHierarchical directory tree discovery (closest file wins)
Google Gemini CLIGEMINI.mdProject root
Cursor.cursor/rules/*.mdcGlob-pattern scoped rules
GitHub Copilot.github/copilot-instructions.mdSingle file + 28-day auto-memory
WindsurfCascade MemoriesAuto-generated from usage patterns

When agents send structured data to LLMs, the serialization format can cause 2–3x differences in token consumption. Results from serializing the same 50-user list in 7 formats:

FormatTokensLLM AccuracyBest For
CSV~80044.3%Pure tables (accuracy risk)
Markdown-KV~95060.7%Simple key-value retrieval
TOON99376.4%Uniform array data
JSON (compact)~1,10073.7%General purpose — safest balance
JSON (pretty)1,48175.0%When human readability is needed
YAML1,71074.5%Nested configs, prompt structuring
XML2,69072.1%Legacy system integration

TOON — A Case Study in Token-Optimized Serialization

Section titled “TOON — A Case Study in Token-Optimized Serialization”

TOON (Token Oriented Object Notation, 2025) preserves JSON’s structure while removing quotes, braces, and commas, representing uniform arrays as CSV-style tables. With 23.7K GitHub stars and 1.6M monthly npm downloads, community interest is real.

// TOON example: uniform array → header + row format
users[3]{id,name,email}:
1,Alice,alice@example.com
2,Bob,bob@example.com
3,Carol,carol@example.com

Strengths: 40–60% token reduction on uniform arrays vs pretty JSON. Limitations: Can be 15–20% larger than compact JSON for non-uniform/nested data. The spec is a Working Draft (v3.0 reached in just 3 weeks from v0.8), and LLMs haven’t been trained on TOON, requiring format explanation in prompts.

  • Default: Compact JSON — all LLMs understand it, mature ecosystem. 25–40% savings vs pretty JSON.
  • Large uniform tables (100+ rows): Consider TOON or CSV — but always measure the accuracy trade-off.
  • API-based structured output: Function calling / structured output APIs are optimal (Microsoft measured: 42% savings vs JSON).
  • Week 7 multi-agent: Design inter-agent artifacts with compact JSON. See Week 7 Artifact Handoff.

  1. Does a 1M token context solve Context Rot? Answer using evidence from the Chroma research data.
  2. Ralph’s fresh context vs. continuous session — when you factor in caching costs (fresh = recreate every time, continuous = read at 0.1x), which is more economical? Under what conditions does this flip?
  3. What is the basis for the advice to keep CLAUDE.md under 200 lines? Connect your answer to SkillReducer’s “less-is-more” effect.
  4. Reason why shuffled documents scored higher accuracy than ordered documents in the Chroma study. What does this imply for context design?
  5. Why does the Initializer pattern store the feature list as JSON? What problem arises if you use Markdown instead?
  6. MECW (Maximum Effective Context Window) is always less than the advertised maximum. If 65% of enterprise AI failures stem from context drift during multi-step reasoning, how should you design agent sessions? Discuss in connection with the “sprint” pattern (short 5-10 message interactions with fresh context).

  1. Measure Token Usage

    Run the /cost command in a Claude Code session to check the current session’s token usage. After 10 turns of conversation, measure again and record the increase.

    import anthropic
    def count_tokens(messages: list) -> int:
    client = anthropic.Anthropic()
    response = client.messages.count_tokens(
    model="claude-opus-4-6",
    messages=messages
    )
    return response.input_tokens
  2. Before/After Compaction Comparison

    In a session with 20+ turns of conversation, run the /compact command. Record and compare token counts, response quality, and task continuity before and after compression.

  3. Build a State File System

    Write helper functions that automatically update fix_plan.md and claude-progress.txt. Refer to state_tracker.py in Lab 05.

  4. Connect to Lab 05

    Based on the experiment results above, implement the four modules in Lab 05: token_counter.py, context_manager.py, state_tracker.py, and main.py.

Submission deadline: 2026-04-07 23:59

Requirements:

  1. Ralph Loop with integrated token counter (ralph_with_context.sh)
  2. Context Rot simulation and a graph of measurement results
  3. Automatic context reset logic implementation
  4. Demonstration that the state tracking system (fix_plan.md + claude-progress.txt) works correctly

  1. Context Rot is an empirical phenomenon — Chroma study: accuracy degrades as input grows longer across all 18 models. 30%+ degradation at mid-window positions.
  2. A 1M token window is not the solution — a bigger window just means a larger space for Context Rot to occur in. Effective usage is 60-70%.
  3. Compaction is token-based — auto-triggered at ~75% of model maximum. Preserves the last 4 messages, summarizes the rest.
  4. Ralph’s fresh context is best between loops — use compaction inside a loop, full reset + state file handoff between loops.
  5. 40-70% of tokens are wasted — cut costs with model routing (Haiku 60-70%), prompt caching (read at 0.1x), and keeping CLAUDE.md under 200 lines.
  6. Initializer pattern — manage state deterministically with a 2-phase structure: JSON feature list + claude-progress.txt + init.sh.
  7. LLM Wiki pattern — beyond per-session state files, an architecture where LLMs permanently maintain a knowledge wiki. Knowledge accumulates through Ingest → Query → Lint cycles.
  8. Knowledge graph token reduction — tools like Graphify that transform codebases into graphs achieve 71.5x token savings vs raw file reading. LazyGraphRAG achieves similar quality at 0.1% of full GraphRAG cost for enterprise scale.
  9. Instruction file convergence — All major coding tools (Claude Code, Codex CLI, Gemini CLI, Cursor, Copilot, Windsurf) independently converged on file-based context persistence. Configuration files are the universal answer to cross-session context management.
  10. MECW < advertised maximum — Maximum Effective Context Window is always less than the nominal limit. 65% of enterprise failures stem from context drift, not context exhaustion. Design for short, focused sessions.