Week 14: Ralphthon Execution

Phase 5Week 14RalphthonLecture: 2026-06-02

Theory

Learning Objectives

Concepts

Define what a “closed-loop MVP” is and is not — a six-component checklist — and pin the minimum state your team must reach by the end of Week 14.

Design

Specify handoff contracts among Lead / Planner / Worker / Reviewer / Operator with explicit input/output schemas, retry budgets, and escalation conditions.

Implementation

Build a runtime skeleton that runs with a single make run (or make replay) and complete the first run.started → run.closed cycle.

Operations

Use the five runbook patterns and the pivot decision tree to author a midpoint report with evidence-backed risks and scope cuts.

Defining the closed-loop MVP

This week’s goal is not a polished product. It is a closed-loop MVP: one task packet goes in, the agent executes, a gate decides, and the event log enables replay.

The six components of a closed-loop MVP

Closed-loop MVP — Six Components

① Task packetJSON Schema validated

② Worker loopa single worker executes

③ Tool boundaryallow / deny enforced

④ Deterministic gateat least one automated test

⑤ Event store.events.jsonl — replayable

⑥ Run reportsuccess / failure run ids tracked

All six must be alive simultaneously. A missing component fails the Week 14 Definition of Done.

Common failure modes

Failure mode	Signal	Response
Open loop (gate missing)	run ends without `run.closed`	add at least one deterministic gate immediately
Replay impossible	partial event log	check flush policy, force run.started/run.closed
Tool boundary ignored	worker edits unscoped files	validate tool policy before invocation
Demo dependency	demo only runs by hand	unify behind `make run TASK=...`

Day-by-day plan

Day 1: runtime skeleton
- Lock the repository layout
- Create tasks/, runs/, artifacts/, reports/
- Author task-packet schema v1
- Record the first run.started / run.closed events
Day 2: first agent loop
- Connect the agent CLI or API client
- A single worker performs one small task
- Run a deterministic gate (pytest/lint/schema)
- On failure, do not retry — fail clearly
Day 3: review and retry
- Add a Reviewer or LLM Judge
- Distinguish fail / revise / pass
- Apply retry budget and max_turns
- Verify event-log replay
Day 4: telemetry and dashboard
- Connect OpenTelemetry or CSV metrics
- Track tokens, latency, pass rate, judge score
- Run the demo three times in a row
- Write the failure runbook
Day 5: midpoint report
- Write the midpoint report
- Cut remaining scope
- Lock the demo path
- List risks for the final week

Likely failure modes by day

Day	Most common failure	Preventive action
Day 1	directories created, first event missing	add a `run.started`/`run.closed` smoke test in the first commit
Day 2	infinite retry	apply max_turns 6, max_tokens 120K immediately
Day 3	broken judge JSON	structured output + fallback `verdict=revise`
Day 4	metrics exist but no dashboard	even a CSV-to-Streamlit page or static HTML works
Day 5	scope refuses to shrink	delete every “Could have” before writing the report

Standardize handoff contracts

You do not need a complex multi-agent system on day one. You do need to separate the artifacts.

Stage	Input	Output
Lead	problem statement	task packet
Planner	task packet	implementation plan
Worker	plan + repo	patch / artifact
Reviewer	patch + criteria	review verdict
Gate	verdict + tests	pass / revise / fail

Handoff schema example

# contracts.py — pydantic-enforced step contracts
from pydantic import BaseModel
from typing import Literal

class LeadDirectiveV1(BaseModel):
    task_id: str
    objective: str
    risk_boundary: str
    deadline: str

class PlanV1(BaseModel):
    task_id: str
    steps: list[str]
    files_to_modify: list[str]
    estimated_turns: int

class WorkerReportV1(BaseModel):
    task_id: str
    run_id: str
    patch_path: str
    tests_command: str
    notes: str

class ReviewResultV1(BaseModel):
    run_id: str
    verdict: Literal["pass", "revise", "fail"]
    reasons: list[str]
    judge_overall: float | None = None

When every step refuses to forward an artifact that fails its pydantic schema, 90%+ of typo and formatting errors disappear by midweek.

Minimum runtime layout

capstone/teams/[team]/
├── README.md
├── design.md
├── tasks/
│   ├── task-001.yaml
│   └── task-002.yaml
├── runtime/
│   ├── runner.py
│   ├── events.py
│   ├── judge.py
│   └── gates.py
├── runs/
│   └── run-001.events.jsonl
├── artifacts/
│   └── run-001.patch
└── reports/
    └── progress-week14.md

One command, one entry point

If teammates remember different commands, reproducibility breaks. Make make run (or uv run) the single entry point.

make run TASK=tasks/task-001.yaml
make replay RUN=runs/run-001.events.jsonl
make test
make report

Runbook pattern catalog

Improvising every time something breaks repeats the same mistakes. Pre-define five standard patterns.

Pattern	Trigger	Immediate action	Follow-up
Tool boundary breach	worker edits an unscoped file	abort run, drop patch	enable strict tool policy
Test thrash	the same test fails twice	stop retry, log failure_reason	split the task into smaller pieces
Judge JSON corruption	invalid JSON rate > 10%	repair step + fallback verdict	turn on structured output
Cost spiral	per-run token budget exceeded	force-close the run	tighten max_turns / context trim
Demo flake	one of three reruns fails	shrink the live demo	keep a recorded fallback

Per-incident runbook table

Problem	Immediate action	Long-term fix	Recurrence prevention
Agent edits an unscoped file	abort the run, drop the patch	tighten task-packet scope and tool policy	path allowlist check inside the tool wrapper
Tests keep failing	stop retry, record failure_reason	split the task	put smoke tests inside the task packet
Judge breaks JSON	mark schema-validation failure	structured output or repair step	”JSON only” stress in the judge prompt
Token cost overruns budget	apply max_turns / max_tokens	reorganize prompt prefix, raise cache hit	cost-dashboard alarm threshold
Teammate file conflicts	declare file ownership	use worktrees or branches	record file owner in the task packet
Unstable demo	prepare a recording	shorten the happy path	add a demo-regression test

Pivot decision framework

Teams hesitate to pivot because “how much should we cut?” feels vague. A decision tree fixes that.

Pivot Decision Tree

Q1. Has the happy path passed end-to-end at least once?

✓ Yes→ move to reliability hardening

✗ No→ force a scope cut

▼

Q2. Can the latest failure be reconstructed from the event log?

✓ Yes→ continue

✗ No→ harden the event store first

▼

Q3. Is at least one automated gate working?

✓ Yes→ evaluate “Should have”

✗ No→ add a single smoke test

▼

Q4. Is the demo path narrowed to one?

✓ Yes→ proceed to reliability / cost / report

✗ No→ collapse to one immediately

Pivot triggers (summary)

If any of the following holds at the midpoint, cut scope.

The happy path has not yet passed end-to-end.
Without the event log you cannot reconstruct the failure.
Not a single gate (judge, policy, test) is automated.
Model connectivity issues are eating the schedule.
Teammates are building toward different goals.

Recognize observability debt

Building the demo without operational evidence collapses Week 15 integration. By Week 14, these patterns mean you are already in debt.

Debt pattern	Meaning	Pay it down by
logs without run id	cannot identify which run produced the log	replace prints with logger calls
missing event flush	events lost on crash	per-line fsync or transactional log
metrics without alerts	breaches go unnoticed	add a single alert (cost or failure rate)
dashboard only on one screen	team cannot see the same view	export a shareable static URL
overrides only in chat	no audit	record `human.override` events

Mentoring check-in questions

In mentoring time, instead of narrating progress, be ready to answer these in one sentence.

What is the current demo path?
What is the last successful run id?
What is the failure reason for the last failed run?
Which scope items can still be dropped?
Which numbers will appear on the final-presentation slide?

Midpoint report template

A progress report is evidence-driven, not narrative.

# Week 14 Progress Report

## Current demo path
- Task packet:
- Last successful run id:
- Failure recovery path:

## Runtime layers completed
| Layer | Status | Evidence |
|-------|--------|----------|
| Tool boundary | | |
| Event store | | |
| Policy gate | | |
| Orchestration | | |

## Metrics
| Run id | Latency | Tokens | Gate result | Notes |
|--------|---------|--------|-------------|-------|

## Quantitative summary
- success_rate (3 attempts): %
- p95 latency: s
- token cost / run: $
- override count:

## Risks and scope cuts
1.
2.
3.

Week 14 Definition of Done

Item	Minimum pass
Closed loop	a single task packet executes end-to-end
Event log	`run.started`, key tool events, and `run.closed` are recorded
Gate	at least one deterministic gate runs automatically
Replay	the event log recomputes the final status
Report	the last successful run and the last failed run are explained
Dashboard	at least one of success rate / cost / latency is visualized

If you cannot pass these, Week 15 is dedicated to closing the loop, not adding features.

Practicum

Day 1 — build the runtime skeleton

Land directories and the first run.started/run.closed event in one sitting.
Day 2 — first agent loop

With one task packet, have the worker produce a patch and pass a deterministic gate.
Day 3 — reviewer + judge integration

Apply handoff contracts and separate fail / revise / pass.
Day 4 — telemetry + dashboard

Expose tokens / latency / pass rate as metrics and stand up a four-panel dashboard.
Day 5 — submit the midpoint report

Write an evidence-driven progress report and decide on scope cuts.

Assignment

Capstone: midpoint progress report

Due: 2026-06-09 23:59

Submission path: capstone/teams/[team]/reports/progress-week14.md

Required:

Current demo path and the latest successful run id
Completed runtime-layer list
One .events.jsonl example with replay output
Implemented items among deterministic gate, judge gate, and human review
Top three risks and the scope-cut plan
Draft demo script for the final presentation
Quantitative summary table (success_rate, p95 latency, token cost)

Key Takeaways

The goal is a closed-loop MVP: a small task goes in, passes a gate, and the event log can replay.
Six components live together: task packet / worker loop / tool boundary / deterministic gate / event store / run report.
Handoff contracts enforced by schema: pydantic or JSON Schema absorbs typos and formatting drift.
Carry five runbook patterns: tool breach, test thrash, judge JSON corruption, cost spiral, demo flake.
Pivot through the decision tree: happy path → event log → gate → demo path; any “no” forces a scope cut.
Pay observability debt now: logs without run_id, missing flushes, and alertless metrics must be cleared in Week 14.
The midpoint report is evidence-driven: run ids, metric tables, and three scope cuts make Week 15 integration possible.