Skip to content

Week 14: Ralphthon Execution

Phase 5Week 14RalphthonLecture: 2026-06-02

Concepts

Define what a “closed-loop MVP” is and is not — a six-component checklist — and pin the minimum state your team must reach by the end of Week 14.

Design

Specify handoff contracts among Lead / Planner / Worker / Reviewer / Operator with explicit input/output schemas, retry budgets, and escalation conditions.

Implementation

Build a runtime skeleton that runs with a single make run (or make replay) and complete the first run.started → run.closed cycle.

Operations

Use the five runbook patterns and the pivot decision tree to author a midpoint report with evidence-backed risks and scope cuts.


This week’s goal is not a polished product. It is a closed-loop MVP: one task packet goes in, the agent executes, a gate decides, and the event log enables replay.

Closed-loop MVP — Six Components
① Task packetJSON Schema validated
② Worker loopa single worker executes
③ Tool boundaryallow / deny enforced
④ Deterministic gateat least one automated test
⑤ Event store.events.jsonl — replayable
⑥ Run reportsuccess / failure run ids tracked

All six must be alive simultaneously. A missing component fails the Week 14 Definition of Done.

Failure modeSignalResponse
Open loop (gate missing)run ends without run.closedadd at least one deterministic gate immediately
Replay impossiblepartial event logcheck flush policy, force run.started/run.closed
Tool boundary ignoredworker edits unscoped filesvalidate tool policy before invocation
Demo dependencydemo only runs by handunify behind make run TASK=...
  1. Day 1: runtime skeleton

    • Lock the repository layout
    • Create tasks/, runs/, artifacts/, reports/
    • Author task-packet schema v1
    • Record the first run.started / run.closed events
  2. Day 2: first agent loop

    • Connect the agent CLI or API client
    • A single worker performs one small task
    • Run a deterministic gate (pytest/lint/schema)
    • On failure, do not retry — fail clearly
  3. Day 3: review and retry

    • Add a Reviewer or LLM Judge
    • Distinguish fail / revise / pass
    • Apply retry budget and max_turns
    • Verify event-log replay
  4. Day 4: telemetry and dashboard

    • Connect OpenTelemetry or CSV metrics
    • Track tokens, latency, pass rate, judge score
    • Run the demo three times in a row
    • Write the failure runbook
  5. Day 5: midpoint report

    • Write the midpoint report
    • Cut remaining scope
    • Lock the demo path
    • List risks for the final week
DayMost common failurePreventive action
Day 1directories created, first event missingadd a run.started/run.closed smoke test in the first commit
Day 2infinite retryapply max_turns 6, max_tokens 120K immediately
Day 3broken judge JSONstructured output + fallback verdict=revise
Day 4metrics exist but no dashboardeven a CSV-to-Streamlit page or static HTML works
Day 5scope refuses to shrinkdelete every “Could have” before writing the report

You do not need a complex multi-agent system on day one. You do need to separate the artifacts.

StageInputOutput
Leadproblem statementtask packet
Plannertask packetimplementation plan
Workerplan + repopatch / artifact
Reviewerpatch + criteriareview verdict
Gateverdict + testspass / revise / fail
# contracts.py — pydantic-enforced step contracts
from pydantic import BaseModel
from typing import Literal
class LeadDirectiveV1(BaseModel):
task_id: str
objective: str
risk_boundary: str
deadline: str
class PlanV1(BaseModel):
task_id: str
steps: list[str]
files_to_modify: list[str]
estimated_turns: int
class WorkerReportV1(BaseModel):
task_id: str
run_id: str
patch_path: str
tests_command: str
notes: str
class ReviewResultV1(BaseModel):
run_id: str
verdict: Literal["pass", "revise", "fail"]
reasons: list[str]
judge_overall: float | None = None

When every step refuses to forward an artifact that fails its pydantic schema, 90%+ of typo and formatting errors disappear by midweek.

capstone/teams/[team]/
├── README.md
├── design.md
├── tasks/
│ ├── task-001.yaml
│ └── task-002.yaml
├── runtime/
│ ├── runner.py
│ ├── events.py
│ ├── judge.py
│ └── gates.py
├── runs/
│ └── run-001.events.jsonl
├── artifacts/
│ └── run-001.patch
└── reports/
└── progress-week14.md

If teammates remember different commands, reproducibility breaks. Make make run (or uv run) the single entry point.

Terminal window
make run TASK=tasks/task-001.yaml
make replay RUN=runs/run-001.events.jsonl
make test
make report

Improvising every time something breaks repeats the same mistakes. Pre-define five standard patterns.

PatternTriggerImmediate actionFollow-up
Tool boundary breachworker edits an unscoped fileabort run, drop patchenable strict tool policy
Test thrashthe same test fails twicestop retry, log failure_reasonsplit the task into smaller pieces
Judge JSON corruptioninvalid JSON rate > 10%repair step + fallback verdictturn on structured output
Cost spiralper-run token budget exceededforce-close the runtighten max_turns / context trim
Demo flakeone of three reruns failsshrink the live demokeep a recorded fallback
ProblemImmediate actionLong-term fixRecurrence prevention
Agent edits an unscoped fileabort the run, drop the patchtighten task-packet scope and tool policypath allowlist check inside the tool wrapper
Tests keep failingstop retry, record failure_reasonsplit the taskput smoke tests inside the task packet
Judge breaks JSONmark schema-validation failurestructured output or repair step”JSON only” stress in the judge prompt
Token cost overruns budgetapply max_turns / max_tokensreorganize prompt prefix, raise cache hitcost-dashboard alarm threshold
Teammate file conflictsdeclare file ownershipuse worktrees or branchesrecord file owner in the task packet
Unstable demoprepare a recordingshorten the happy pathadd a demo-regression test

Teams hesitate to pivot because “how much should we cut?” feels vague. A decision tree fixes that.

Pivot Decision Tree
Q1. Has the happy path passed end-to-end at least once?
✓ Yes→ move to reliability hardening
✗ No→ force a scope cut
Q2. Can the latest failure be reconstructed from the event log?
✓ Yes→ continue
✗ No→ harden the event store first
Q3. Is at least one automated gate working?
✓ Yes→ evaluate “Should have”
✗ No→ add a single smoke test
Q4. Is the demo path narrowed to one?
✓ Yes→ proceed to reliability / cost / report
✗ No→ collapse to one immediately

If any of the following holds at the midpoint, cut scope.

  • The happy path has not yet passed end-to-end.
  • Without the event log you cannot reconstruct the failure.
  • Not a single gate (judge, policy, test) is automated.
  • Model connectivity issues are eating the schedule.
  • Teammates are building toward different goals.

Building the demo without operational evidence collapses Week 15 integration. By Week 14, these patterns mean you are already in debt.

Debt patternMeaningPay it down by
logs without run idcannot identify which run produced the logreplace prints with logger calls
missing event flushevents lost on crashper-line fsync or transactional log
metrics without alertsbreaches go unnoticedadd a single alert (cost or failure rate)
dashboard only on one screenteam cannot see the same viewexport a shareable static URL
overrides only in chatno auditrecord human.override events

In mentoring time, instead of narrating progress, be ready to answer these in one sentence.

  1. What is the current demo path?
  2. What is the last successful run id?
  3. What is the failure reason for the last failed run?
  4. Which scope items can still be dropped?
  5. Which numbers will appear on the final-presentation slide?

A progress report is evidence-driven, not narrative.

# Week 14 Progress Report
## Current demo path
- Task packet:
- Last successful run id:
- Failure recovery path:
## Runtime layers completed
| Layer | Status | Evidence |
|-------|--------|----------|
| Tool boundary | | |
| Event store | | |
| Policy gate | | |
| Orchestration | | |
## Metrics
| Run id | Latency | Tokens | Gate result | Notes |
|--------|---------|--------|-------------|-------|
## Quantitative summary
- success_rate (3 attempts): %
- p95 latency: s
- token cost / run: $
- override count:
## Risks and scope cuts
1.
2.
3.
ItemMinimum pass
Closed loopa single task packet executes end-to-end
Event logrun.started, key tool events, and run.closed are recorded
Gateat least one deterministic gate runs automatically
Replaythe event log recomputes the final status
Reportthe last successful run and the last failed run are explained
Dashboardat least one of success rate / cost / latency is visualized

If you cannot pass these, Week 15 is dedicated to closing the loop, not adding features.

  1. Day 1 — build the runtime skeleton

    Land directories and the first run.started/run.closed event in one sitting.

  2. Day 2 — first agent loop

    With one task packet, have the worker produce a patch and pass a deterministic gate.

  3. Day 3 — reviewer + judge integration

    Apply handoff contracts and separate fail / revise / pass.

  4. Day 4 — telemetry + dashboard

    Expose tokens / latency / pass rate as metrics and stand up a four-panel dashboard.

  5. Day 5 — submit the midpoint report

    Write an evidence-driven progress report and decide on scope cuts.

Due: 2026-06-09 23:59

Submission path: capstone/teams/[team]/reports/progress-week14.md

Required:

  1. Current demo path and the latest successful run id
  2. Completed runtime-layer list
  3. One .events.jsonl example with replay output
  4. Implemented items among deterministic gate, judge gate, and human review
  5. Top three risks and the scope-cut plan
  6. Draft demo script for the final presentation
  7. Quantitative summary table (success_rate, p95 latency, token cost)
  1. The goal is a closed-loop MVP: a small task goes in, passes a gate, and the event log can replay.
  2. Six components live together: task packet / worker loop / tool boundary / deterministic gate / event store / run report.
  3. Handoff contracts enforced by schema: pydantic or JSON Schema absorbs typos and formatting drift.
  4. Carry five runbook patterns: tool breach, test thrash, judge JSON corruption, cost spiral, demo flake.
  5. Pivot through the decision tree: happy path → event log → gate → demo path; any “no” forces a scope cut.
  6. Pay observability debt now: logs without run_id, missing flushes, and alertless metrics must be cleared in Week 14.
  7. The midpoint report is evidence-driven: run ids, metric tables, and three scope cuts make Week 15 integration possible.

Foundational

Operations / SRE

  • Google SRE Book — Incident Response
  • The Phoenix Project — runbook culture
  • Atlassian, “How to write a postmortem”

Reference cases

  • Anthropic — Multi-agent system best practices
  • ThoughtWorks Tech Radar — Capstone-style agentic delivery