Concepts
Define Context Rot and explain why it persists in long-context models in terms of attention dilution and instruction drift.
Concepts
Define Context Rot and explain why it persists in long-context models in terms of attention dilution and instruction drift.
Design
Compare three strategies — context window wiping, summarization, state-tracking files — and decide which combination fits a Ralph loop.
Implementation
Author tasks.md (or an equivalent state file) that serializes compact progress every turn in a token-efficient way.
Operations
Measure cache hit ratio, prompt tokens, and summarization cost to report the ROI of your context strategy quantitatively.
In Week 4 we learned the power of the loop paradigm — calling the same model repeatedly while using deterministic validation to ensure quality. But for a loop to work, there is a prerequisite: the context must be clean.
Week 4’s Huntley knew about this problem. That’s why he chose a “fresh context” strategy — starting a new session on every loop iteration. But that alone isn’t enough:
The central question for this week: how do we manage context deterministically?
To answer that question, we’ll first confirm with empirical data why Context Rot happens and how serious it is, then build up solutions from there.
Context Rot is the phenomenon where an agent’s context window becomes increasingly polluted over a long session:
“Longer context gets worse” is intuitive, but Chroma’s 2026 empirical study provides a representative quantitative measurement of how much worse it gets. Testing 18 frontier models (Claude, GPT, Gemini, Llama, and others):
| Finding | Data |
|---|---|
| Accuracy drop at mid-window position | 30%+ |
| Correlation between input length and accuracy | Negative across all models — no exceptions |
| Counter-intuitive result | Shuffled documents scored higher accuracy than logically ordered documents |
The last finding is especially important. When documents are arranged in logical order, models tend to judge “I already saw this earlier, I can skim the rest.” Shuffling forces attention at every position, which actually raises accuracy.
Frontier model context windows as of 2026:
| Model | Official Context | Effective Usage |
|---|---|---|
| current Claude frontier family | 1M-token class | ~600-700K |
| GPT-5.4 | 1M tokens | ~600-700K |
| Gemini 2.5 Pro | 1M tokens | ~600-700K |
The reason effective usage is 60-70%: the remainder is consumed by the system prompt (~50K), tool schemas (~30K), and safety margin (~200K).
When auto-compaction fires, preservation priority is decisive:
| Priority | What to Keep | Why |
|---|---|---|
| 1 (highest) | System prompt + CLAUDE.md | The agent’s “constitution” — losing this erases behavioral rules |
| 2 | Last 4 messages | Immediate context of the current task |
| 3 | Tool results for the current task | The file just read, the test just run |
| 4 (lowest) | Old conversation + previous tool results | Can be replaced with a summary |
When NOT to compress: In the following situations, it is better to end the session and start fresh rather than compact:
This is why Huntley in Week 4 chose “fresh context” — between loop iterations, full reset + state file handoff is more deterministic than compaction.
One of the key innovations of the Ralph Loop is completely resetting the context after a task completes or fails:
class RalphContextManager: def __init__(self, max_tokens: int = 200_000): self.max_tokens = max_tokens self.state_file = "claude-progress.txt"
def should_wipe_context(self, current_tokens: int) -> bool: """Reset context when more than 75% of the window is used""" return current_tokens > self.max_tokens * 0.75
def build_fresh_context(self) -> str: """Deterministically reconstruct context from the state file""" state = self.load_state() return f"""# Project State{state['completed_tasks']}
# Current Task{state['current_task']}
# Relevant Code (current version only){state['relevant_code_snippet']}"""
def save_state(self, task: str, status: str): """Save state for the next loop iteration""" with open(self.state_file, 'a') as f: f.write(f"[{status}] {task}\n")fix_plan.md Template:
# Project: Calculator App## Completed Tasks- [x] Create basic file structure (2026-03-31 14:23)- [x] Implement add() function and pass tests (2026-03-31 14:45)
## Current Task- [ ] Implement subtract() function - Expected file: calculator.py:15-25 - Related tests: tests/test_calculator.py:20-35
## Pending Tasks- [ ] multiply() function- [ ] divide() function (must handle division-by-zero exception)The ultimate goal of context management is to do more useful work on the same budget. Empirical data shows that 40-70% of agent input tokens are wasted — duplicate tool results, unnecessary file contents, bloated system prompts.
| Task Type | Share | Recommended Model | Cost (1M tokens) |
|---|---|---|---|
| Simple lookups, formatting, type checking | 60-70% | Haiku | $1 / $5 |
| Standard coding, bug fixes, feature additions | 25-30% | Sonnet | $15 / $75 |
| Architecture design, complex debugging | 5-10% | Opus | $15 / $75 |
Model routing alone enables 5-8x cost reduction. Claude Code’s effort parameter (see Week 4) is the productized form of this routing.
On every agent turn, the system prompt, tool schemas, and CLAUDE.md contain the same content repeated. Prompt caching stores this static portion and reuses it:
| Operation | Price (vs. baseline) |
|---|---|
| Cache write (5-min TTL) | 1.25x |
| Cache write (1-hour TTL) | 2x |
| Cache read | 0.1x (90% savings) |
Implications for the loop paradigm:
The 2-phase pattern recommended in Anthropic’s official harness guide systematizes the state file design above:
Phase 1 — Initializer (first loop):
claude-progress.txtinit.sh (environment setup script){ "features": [ {"id": "F001", "name": "User Authentication", "status": "pending", "files": ["src/auth.py"]}, {"id": "F002", "name": "Dashboard UI", "status": "pending", "files": ["src/dashboard.py"]} ], "constraints": ["pytest must pass", "100% type hints"]}Phase 2 — Coding Agent (subsequent loops):
init.sh to configure the environment"status": "pending" item from the JSON and work on it"status": "done" + record in claude-progress.txtThis pattern is a higher-level abstraction of the three state files in today’s Week 5 (claude-progress.txt, fix_plan.md, @codebase_map.md). The JSON feature list replaces fix_plan.md; init.sh replaces @codebase_map.md.
Context management approaches differ by tool. As of 2026, three strategies are competing:
| Strategy | Representative Tool | Approach | Pros | Cons |
|---|---|---|---|---|
| Explicit | Cursor | User manually selects which files go into context | Precise control, minimal token waste | Manual labor, risk of omission |
| Ambient | Windsurf (Cascade) | Tool automatically detects relevant files | Convenient, prevents omissions | Risk of over-inclusion, token waste |
| Hybrid | Claude Code | File-based persistence (CLAUDE.md) + auto-compaction | Balanced, loop-friendly | Requires setup, learning curve |
VS Code Copilot introduced 3-tier memory (user/repository/session) in 2026, separating user preferences (global) → project rules (repo) → current conversation (session). This is the same design principle as Claude Code’s 3-level CLAUDE.md hierarchy (global/project/local).
When agents send structured data to LLMs, the serialization format can cause 2–3x differences in token consumption. Results from serializing the same 50-user list in 7 formats:
| Format | Tokens | LLM Accuracy | Best For |
|---|---|---|---|
| CSV | ~800 | 44.3% | Pure tables (accuracy risk) |
| Markdown-KV | ~950 | 60.7% | Simple key-value retrieval |
| TOON | 993 | 76.4% | Uniform array data |
| JSON (compact) | ~1,100 | 73.7% | General purpose — safest balance |
| JSON (pretty) | 1,481 | 75.0% | When human readability is needed |
| YAML | 1,710 | 74.5% | Nested configs, prompt structuring |
| XML | 2,690 | 72.1% | Legacy system integration |
TOON (Token Oriented Object Notation, 2025) preserves JSON’s structure while removing quotes, braces, and commas, representing uniform arrays as CSV-style tables. With 23.7K GitHub stars and 1.6M monthly npm downloads, community interest is real.
// TOON example: uniform array → header + row formatusers[3]{id,name,email}: 1,Alice,alice@example.com 2,Bob,bob@example.com 3,Carol,carol@example.comStrengths: 40–60% token reduction on uniform arrays vs pretty JSON. Limitations: Can be 15–20% larger than compact JSON for non-uniform/nested data. The spec is a Working Draft (v3.0 reached in just 3 weeks from v0.8), and LLMs haven’t been trained on TOON, requiring format explanation in prompts.
Measure Token Usage
Run the /cost command in a Claude Code session to check the current session’s token usage. After 10 turns of conversation, measure again and record the increase.
import osimport anthropic
def count_tokens(messages: list) -> int: client = anthropic.Anthropic() response = client.messages.count_tokens( model=os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-5"), messages=messages ) return response.input_tokensBefore/After Compaction Comparison
In a session with 20+ turns of conversation, run the /compact command. Record and compare token counts, response quality, and task continuity before and after compression.
Build a State File System
Write helper functions that automatically update fix_plan.md and claude-progress.txt. Refer to state_tracker.py in Lab 05.
Connect to Lab 05
Based on the experiment results above, implement the four modules in Lab 05: token_counter.py, context_manager.py, state_tracker.py, and main.py.
Submission deadline: 2026-04-07 23:59
Requirements:
ralph_with_context.sh)fix_plan.md + claude-progress.txt) works correctly