Concepts
Define what a “closed-loop MVP” is and is not — a six-component checklist — and pin the minimum state your team must reach by the end of Week 14.
Concepts
Define what a “closed-loop MVP” is and is not — a six-component checklist — and pin the minimum state your team must reach by the end of Week 14.
Design
Specify handoff contracts among Lead / Planner / Worker / Reviewer / Operator with explicit input/output schemas, retry budgets, and escalation conditions.
Implementation
Build a runtime skeleton that runs with a single make run (or make replay) and complete the first run.started → run.closed cycle.
Operations
Use the five runbook patterns and the pivot decision tree to author a midpoint report with evidence-backed risks and scope cuts.
This week’s goal is not a polished product. It is a closed-loop MVP: one task packet goes in, the agent executes, a gate decides, and the event log enables replay.
All six must be alive simultaneously. A missing component fails the Week 14 Definition of Done.
| Failure mode | Signal | Response |
|---|---|---|
| Open loop (gate missing) | run ends without run.closed | add at least one deterministic gate immediately |
| Replay impossible | partial event log | check flush policy, force run.started/run.closed |
| Tool boundary ignored | worker edits unscoped files | validate tool policy before invocation |
| Demo dependency | demo only runs by hand | unify behind make run TASK=... |
Day 1: runtime skeleton
tasks/, runs/, artifacts/, reports/run.started / run.closed eventsDay 2: first agent loop
Day 3: review and retry
Day 4: telemetry and dashboard
Day 5: midpoint report
| Day | Most common failure | Preventive action |
|---|---|---|
| Day 1 | directories created, first event missing | add a run.started/run.closed smoke test in the first commit |
| Day 2 | infinite retry | apply max_turns 6, max_tokens 120K immediately |
| Day 3 | broken judge JSON | structured output + fallback verdict=revise |
| Day 4 | metrics exist but no dashboard | even a CSV-to-Streamlit page or static HTML works |
| Day 5 | scope refuses to shrink | delete every “Could have” before writing the report |
You do not need a complex multi-agent system on day one. You do need to separate the artifacts.
| Stage | Input | Output |
|---|---|---|
| Lead | problem statement | task packet |
| Planner | task packet | implementation plan |
| Worker | plan + repo | patch / artifact |
| Reviewer | patch + criteria | review verdict |
| Gate | verdict + tests | pass / revise / fail |
# contracts.py — pydantic-enforced step contractsfrom pydantic import BaseModelfrom typing import Literal
class LeadDirectiveV1(BaseModel): task_id: str objective: str risk_boundary: str deadline: str
class PlanV1(BaseModel): task_id: str steps: list[str] files_to_modify: list[str] estimated_turns: int
class WorkerReportV1(BaseModel): task_id: str run_id: str patch_path: str tests_command: str notes: str
class ReviewResultV1(BaseModel): run_id: str verdict: Literal["pass", "revise", "fail"] reasons: list[str] judge_overall: float | None = NoneWhen every step refuses to forward an artifact that fails its pydantic schema, 90%+ of typo and formatting errors disappear by midweek.
capstone/teams/[team]/├── README.md├── design.md├── tasks/│ ├── task-001.yaml│ └── task-002.yaml├── runtime/│ ├── runner.py│ ├── events.py│ ├── judge.py│ └── gates.py├── runs/│ └── run-001.events.jsonl├── artifacts/│ └── run-001.patch└── reports/ └── progress-week14.mdIf teammates remember different commands, reproducibility breaks. Make make run (or uv run) the single entry point.
make run TASK=tasks/task-001.yamlmake replay RUN=runs/run-001.events.jsonlmake testmake reportImprovising every time something breaks repeats the same mistakes. Pre-define five standard patterns.
| Pattern | Trigger | Immediate action | Follow-up |
|---|---|---|---|
| Tool boundary breach | worker edits an unscoped file | abort run, drop patch | enable strict tool policy |
| Test thrash | the same test fails twice | stop retry, log failure_reason | split the task into smaller pieces |
| Judge JSON corruption | invalid JSON rate > 10% | repair step + fallback verdict | turn on structured output |
| Cost spiral | per-run token budget exceeded | force-close the run | tighten max_turns / context trim |
| Demo flake | one of three reruns fails | shrink the live demo | keep a recorded fallback |
| Problem | Immediate action | Long-term fix | Recurrence prevention |
|---|---|---|---|
| Agent edits an unscoped file | abort the run, drop the patch | tighten task-packet scope and tool policy | path allowlist check inside the tool wrapper |
| Tests keep failing | stop retry, record failure_reason | split the task | put smoke tests inside the task packet |
| Judge breaks JSON | mark schema-validation failure | structured output or repair step | ”JSON only” stress in the judge prompt |
| Token cost overruns budget | apply max_turns / max_tokens | reorganize prompt prefix, raise cache hit | cost-dashboard alarm threshold |
| Teammate file conflicts | declare file ownership | use worktrees or branches | record file owner in the task packet |
| Unstable demo | prepare a recording | shorten the happy path | add a demo-regression test |
Teams hesitate to pivot because “how much should we cut?” feels vague. A decision tree fixes that.
If any of the following holds at the midpoint, cut scope.
Building the demo without operational evidence collapses Week 15 integration. By Week 14, these patterns mean you are already in debt.
| Debt pattern | Meaning | Pay it down by |
|---|---|---|
| logs without run id | cannot identify which run produced the log | replace prints with logger calls |
| missing event flush | events lost on crash | per-line fsync or transactional log |
| metrics without alerts | breaches go unnoticed | add a single alert (cost or failure rate) |
| dashboard only on one screen | team cannot see the same view | export a shareable static URL |
| overrides only in chat | no audit | record human.override events |
In mentoring time, instead of narrating progress, be ready to answer these in one sentence.
A progress report is evidence-driven, not narrative.
# Week 14 Progress Report
## Current demo path- Task packet:- Last successful run id:- Failure recovery path:
## Runtime layers completed| Layer | Status | Evidence ||-------|--------|----------|| Tool boundary | | || Event store | | || Policy gate | | || Orchestration | | |
## Metrics| Run id | Latency | Tokens | Gate result | Notes ||--------|---------|--------|-------------|-------|
## Quantitative summary- success_rate (3 attempts): %- p95 latency: s- token cost / run: $- override count:
## Risks and scope cuts1.2.3.| Item | Minimum pass |
|---|---|
| Closed loop | a single task packet executes end-to-end |
| Event log | run.started, key tool events, and run.closed are recorded |
| Gate | at least one deterministic gate runs automatically |
| Replay | the event log recomputes the final status |
| Report | the last successful run and the last failed run are explained |
| Dashboard | at least one of success rate / cost / latency is visualized |
If you cannot pass these, Week 15 is dedicated to closing the loop, not adding features.
Day 1 — build the runtime skeleton
Land directories and the first run.started/run.closed event in one sitting.
Day 2 — first agent loop
With one task packet, have the worker produce a patch and pass a deterministic gate.
Day 3 — reviewer + judge integration
Apply handoff contracts and separate fail / revise / pass.
Day 4 — telemetry + dashboard
Expose tokens / latency / pass rate as metrics and stand up a four-panel dashboard.
Day 5 — submit the midpoint report
Write an evidence-driven progress report and decide on scope cuts.
Due: 2026-06-09 23:59
Submission path: capstone/teams/[team]/reports/progress-week14.md
Required:
.events.jsonl example with replay outputFoundational
Operations / SRE
Reference cases