Skip to content

Week 13: Capstone Project Design

Phase 5Week 13AdvancedLecture: 2026-05-26

Concepts

Explain why a Ralphthon is unlike a hackathon (closed-loop system framing) and apply the five conditions of a good capstone topic to your team’s candidate ideas.

Design

Map the Agent OS Runtime L1-L7 core onto your capstone, and add L8 workflow-plane cycle / phase / policy definitions only if your team needs multi-phase orchestration.

Implementation

Distribute responsibilities across Lead / Planner / Worker / Reviewer / Operator, and write three ADRs and five risk-register items.

Operations

Reverse-plan Weeks 14-16 (build / integrate / present), pin the demo path, and explicitly declare a “Won’t have” list.


Ralphthon is a team capstone that uses the Ralph-loop methodology to solve a real software problem. Unlike a typical hackathon, the deliverable is not “an app” — it is a repeatable agent system.

AxisGeneric hackathonRalphthon
Deliverableone demo appa repeatable agent system
Success criteriondemo worksthe same task packet executes consistently three times
AI usagefreeform (code copilot)task packet → harness → gate
Evaluation evidencedemo videoevent log, replay snapshot, gate results
Team compositionfreeformLead / Planner / Worker / Reviewer / Operator
Failure handlingimprovisedrunbook + retry budget + escalation
Post-event operationsusually endsreplay and ADRs make it reproducible
ConditionDescriptionBad example
Repeatablethe same kind of task happens many timesone-off demo
Verifiabletests, rubrics, judges, and human review apply”write something good”
Boundedfiles, tools, and permissions can be limitedfree rein over the internet
Recoverablewrong outputs can be rolled back or retrieddirect DB mutations
Role-separableplanner / worker / reviewer / operator splitone person owns everything

Capstone through the Agent OS Runtime lens

Section titled “Capstone through the Agent OS Runtime lens”

The Week 13 design document must specify at least five of the L1-L7 core layers below. Add the L8 workflow plane only if your team is explicitly designing a multi-phase cycle.

LayerWhat to define for the capstone
L1 MCP Tool Protocolallowed / denied tools, tool input/output schemas, tool events
L2 Provider Completionmodel profiles, cost/latency budgets, fallback rules
L3 Plan-Work-Review CollaborationLead, Planner, Worker, Reviewer state transitions
L4 Event Store.events.jsonl, replay snapshot
L5 Markdown-SSOT Skill Runtimerole instructions, rubrics, allowed tool scope
L6 Hook Lifecycleapprovals, secret scans, loop stops, escalation hooks
L7 Schema IPC Registrytask packet, worker report, review verdict, run report schemas
Optional L8 Workflow Planecycle / phase / policy / persona / artifact Markdown SSOT

See Agent OS 7+1-Layer Architecture (L1–L7 core + L8 workflow plane) for layer-by-layer detail. Teams that want to design multi-phase cycles (e.g. brainstorm → fix → ship) should also consult the L8 Workflow Plane five-axis model (cycle / phase / policy / persona / artifact). L8 is optional — the capstone rubric is fully satisfied by the L1–L7 core alone; L8 is for teams that want to lift cycle sequencing, policies, and personas into Markdown SSOT.

Agent OS Runtime — L1-L7 Core + Optional L8
Optional L8 Workflow Planecycle · phase · policy · persona · artifact
L7 Schema IPCtask packet · report · verdict schemas
L6 Hook Lifecycleapproval · secret scan · loop stop
L5 Skill RuntimeMarkdown instruction · allowed tools
L4 Event Store.events.jsonl · replay snapshot
L3 CollaborationLead · Planner · Worker · Reviewer
L2 Provider Completionmodel profile · fallback · budget
L1 MCP Tool Protocolallowed tools · input/output schema

Every core layer carries an artifact. You do not have to build all of L1-L7 in Week 13, but at least five core layers must be concrete and the rest must have Week 14 owners. Teams that choose L8 also keep cycle/phase/policy Markdown and workflow.* event evidence as separate deliverables.

Lead / Architect

Owns problem framing, scope, and final design. Prevents scope creep and locks acceptance criteria.

Harness Engineer

Builds task packets, the event store, policy gates, and the retry/rollback logic.

Agent Engineer

Builds role prompts, tool policies, model routing, and CLI integration.

QA / Operator

Owns tests, judge rubrics, telemetry dashboards, and demo stability.

Capstone Team Topology
Lead / Architectissues directives · controls scope
▼ directive
Planner Agentspec / plan generation
▼ plan
Worker Agentpatch / artifact production
▼ patch
Reviewer Agentreview verdict
▼ verdict
Operator / QAgate integration · dashboards · demo stability
▲ gate result back to Lead
Event Storetool · review · metric events all converge → replay snapshot

The point is that the Lead does not make every decision. The Operator reads metrics and the dashboard for live signal.

  1. Autonomous code review — Read GitHub PRs and emit risk-ranked review comments.
  2. Test generator — Find missing branches in existing code and generate pytest cases.
  3. Documentation automation — Refresh README/API docs after a change and detect drift.
  4. Bug triage + patch agent — Convert issues into task packets and propose small patches.
  5. Performance optimization agent — Find benchmark regressions and propose candidate fixes.
  6. Course-material validator — Inspect MDX links, frontmatter, and outdated tech tables.
  7. MCP security audit agent — Find risky patterns in MCP server manifests and tool descriptions.

A good capstone does not announce a finished product on day one. Split must-haves and won’t-haves explicitly.

BucketExamplesDecision rule
Must havetask packet, single worker loop, deterministic gate, event logwithout it, there is no closed loop
Should havereviewer / judge, replay snapshot, simple dashboardstrengthens the final evidence
Could haveweb UI, multi-model router, agent marketplaceonly if time remains
Won’t havefull autonomous deploy, external account control, complex permission delegationrisk and cost too high

The Week 13 design document must contain a Won't have list. A plan without scope cuts is a wish list, not a plan.

A natural-language sentence is too unstable for an agent. In the capstone, every task is sent as a packet.

task_id: capstone-017
objective: "Add retry handling to the GitHub issue importer"
scope:
files:
- src/importer/github.py
- tests/test_github_importer.py
allowed_tools:
- read_file
- edit_file
- run_tests
acceptance:
- "pytest tests/test_github_importer.py passes"
- "No network call in unit tests"
- "Retry count is configurable"
budget:
max_turns: 6
max_tokens: 120000
escalation:
ask_human_if:
- "API contract must change"
- "Secret or credential is required"

Task packet rubric — good / borderline / anti-pattern

Section titled “Task packet rubric — good / borderline / anti-pattern”
FieldGoodBorderlineAnti-pattern
Objectiveverb + measurable outcomevague verb”make it better”
Scopefiles / dirs explicit”related code”entire repo
Allowed toolslimited to read/edit/run_testsbroad”any”
Acceptance3-5 pass/fail itemsone itemnone
Budgetturns and tokens specifiedone of the twounlimited
Escalationcondition + ownercondition onlynone
{
"type": "object",
"required": ["task_id", "objective", "scope", "acceptance", "budget"],
"properties": {
"task_id": {"type": "string", "pattern": "^[a-z0-9-]{4,}$"},
"objective": {"type": "string", "minLength": 10},
"scope": {
"type": "object",
"properties": {
"files": {"type": "array", "items": {"type": "string"}}
}
},
"allowed_tools": {"type": "array", "items": {"type": "string"}},
"acceptance": {
"type": "array", "minItems": 1, "items": {"type": "string"}
},
"budget": {
"type": "object",
"required": ["max_turns", "max_tokens"],
"properties": {
"max_turns": {"type": "integer", "maximum": 15},
"max_tokens": {"type": "integer"}
}
},
"escalation": {"type": "object"}
}
}

Promising not to accept any task that fails this schema validator pays large dividends once the capstone starts.

Do not list risks abstractly. Use a STRIDE-inspired table.

Threat categoryCapstone applicationExample
Spoofingmodel identityanother model answers under the same alias
Tamperingtask / artifact tamperingevent log post-processed, results forgeable
Repudiationaccountability dodgeoverrides happen anonymously
Information disclosuresensitive data leakcode or secrets sent to an external API
Denial of serviceresource exhaustioninfinite retry, queue overflow
Elevation of privilegeprivilege overreachedits files outside the allowed scope

For each row, state in one line how the current plan blocks the threat. If there is no block, that is a Week 14 to-build item.

Each team writes ADRs for major decisions. At least three are required.

# ADR-001: Use vLLM OpenAI-compatible API
## Context
We need to compare local and commercial models through the same harness.
## Decision
Expose local models through vLLM's OpenAI-compatible API and route calls through one client wrapper.
## Consequences
- Good: model provider can be swapped without changing agent code.
- Bad: model-specific tool-call parsers still require configuration.
AspectGood ADRBad ADR
Contextone or two sentences on “why now”a corporate vision
Decisionone sentence + the alternative we rejected”Use X”
Consequencesboth Good and Badonly Good
Ownerone named personanonymous
Datethe decision dateempty
# Team Name Capstone Design
## 1. Problem
## 2. Users and Risk Boundary
## 3. Agent Architecture
## 4. Runtime Layers
## 5. Task Packet Schema
## 6. Evaluation Gates
## 7. Telemetry and Replay
## 8. Implementation Plan
## 9. Demo Scenario

Risks belong in a structured register: trigger, owner, response.

RiskTriggerOwnerResponse
Model output repeatedly violates JSON schemainvalid JSON rate > 20%Agent Engineerstructured output or repair step
No tests, so quality cannot be evaluateddeterministic gate is emptyQA / Operatorminimum smoke test + fixture
Scope creephappy path not passing by end of Week 14Leaddrop all “Could have” items
Cost overrunrun-level token budget exceededHarness Engineerapply max_turns, context trim, cache prefix
Unstable demoone of three repeated runs failsQA / Operatorshrink live demo, prepare a recording

This table is used as input for the Week 14 midpoint report and the Week 15 release gate.

  1. Define the team’s problem

    Pick a real, repeatable task and define the user in one sentence.

  2. Write success criteria

    Replace “good” with five pass/fail criteria.

  3. Decompose agent roles

    Pick only the necessary roles among Lead / Planner / Worker / Reviewer / Operator.

  4. Map runtime layers

    For each layer, decide which file, script, log, or policy belongs there.

  5. Author the task packet schema

    Attach one JSON Schema and three examples (good / borderline / anti-pattern).

  6. Write three ADRs

    One for model selection, one for harness library, one for evaluation strategy.

  7. Pin the demo path

    Choose one happy path and one failure-recovery path to show in the final presentation.

Due: 2026-06-02 23:59

Submission path: capstone/teams/[team]/design.md

Required:

  1. Problem statement and user / risk boundary
  2. Agent architecture diagram
  3. Mapping for at least five Agent OS Runtime layers
  4. Task packet schema and three examples (good / borderline / anti-pattern)
  5. Deterministic gate + LLM Judge + human review criteria
  6. Risk register with five entries (trigger / owner / response)
  7. Three or more ADRs
  8. Implementation plan for Weeks 14-16 and a draft demo script
  1. A Ralphthon is not “build an app”: it is designing, operating, and proving a closed-loop system.
  2. Five conditions of a good topic: repeatable / verifiable / bounded / recoverable / role-separable.
  3. You do not have to build all L1-L7 core layers: declare at least five core layers in Week 13 and assign owners for the rest. L8 is optional.
  4. “Won’t have” is the real plan: a plan without scope cuts is a wish list.
  5. Force the task packet through schema: only packets that pass the JSON Schema validator are accepted by workers.
  6. An ADR is a record of decisions: what you rejected matters more than what you chose.
  7. The risk register is the entrance to Week 14: each risk is a triple of trigger, owner, and response — and is tracked weekly.

Foundational

ADR / design culture

  • Michael Nygard, “Documenting Architecture Decisions” (the original ADR pattern)
  • ThoughtWorks Tech Radar — ADR adoption

Evaluation / risk

  • STRIDE threat modeling primer
  • Google SRE Workbook — Risk Analysis

Capstone artifacts