Concepts
Explain why a Ralphthon is unlike a hackathon (closed-loop system framing) and apply the five conditions of a good capstone topic to your team’s candidate ideas.
Concepts
Explain why a Ralphthon is unlike a hackathon (closed-loop system framing) and apply the five conditions of a good capstone topic to your team’s candidate ideas.
Design
Map the Agent OS Runtime L1-L7 core onto your capstone, and add L8 workflow-plane cycle / phase / policy definitions only if your team needs multi-phase orchestration.
Implementation
Distribute responsibilities across Lead / Planner / Worker / Reviewer / Operator, and write three ADRs and five risk-register items.
Operations
Reverse-plan Weeks 14-16 (build / integrate / present), pin the demo path, and explicitly declare a “Won’t have” list.
Ralphthon is a team capstone that uses the Ralph-loop methodology to solve a real software problem. Unlike a typical hackathon, the deliverable is not “an app” — it is a repeatable agent system.
| Axis | Generic hackathon | Ralphthon |
|---|---|---|
| Deliverable | one demo app | a repeatable agent system |
| Success criterion | demo works | the same task packet executes consistently three times |
| AI usage | freeform (code copilot) | task packet → harness → gate |
| Evaluation evidence | demo video | event log, replay snapshot, gate results |
| Team composition | freeform | Lead / Planner / Worker / Reviewer / Operator |
| Failure handling | improvised | runbook + retry budget + escalation |
| Post-event operations | usually ends | replay and ADRs make it reproducible |
| Condition | Description | Bad example |
|---|---|---|
| Repeatable | the same kind of task happens many times | one-off demo |
| Verifiable | tests, rubrics, judges, and human review apply | ”write something good” |
| Bounded | files, tools, and permissions can be limited | free rein over the internet |
| Recoverable | wrong outputs can be rolled back or retried | direct DB mutations |
| Role-separable | planner / worker / reviewer / operator split | one person owns everything |
The Week 13 design document must specify at least five of the L1-L7 core layers below. Add the L8 workflow plane only if your team is explicitly designing a multi-phase cycle.
| Layer | What to define for the capstone |
|---|---|
| L1 MCP Tool Protocol | allowed / denied tools, tool input/output schemas, tool events |
| L2 Provider Completion | model profiles, cost/latency budgets, fallback rules |
| L3 Plan-Work-Review Collaboration | Lead, Planner, Worker, Reviewer state transitions |
| L4 Event Store | .events.jsonl, replay snapshot |
| L5 Markdown-SSOT Skill Runtime | role instructions, rubrics, allowed tool scope |
| L6 Hook Lifecycle | approvals, secret scans, loop stops, escalation hooks |
| L7 Schema IPC Registry | task packet, worker report, review verdict, run report schemas |
| Optional L8 Workflow Plane | cycle / phase / policy / persona / artifact Markdown SSOT |
See Agent OS 7+1-Layer Architecture (L1–L7 core + L8 workflow plane) for layer-by-layer detail. Teams that want to design multi-phase cycles (e.g. brainstorm → fix → ship) should also consult the L8 Workflow Plane five-axis model (cycle / phase / policy / persona / artifact). L8 is optional — the capstone rubric is fully satisfied by the L1–L7 core alone; L8 is for teams that want to lift cycle sequencing, policies, and personas into Markdown SSOT.
Every core layer carries an artifact. You do not have to build all of L1-L7 in Week 13, but at least five core layers must be concrete and the rest must have Week 14 owners. Teams that choose L8 also keep cycle/phase/policy Markdown and workflow.* event evidence as separate deliverables.
Lead / Architect
Owns problem framing, scope, and final design. Prevents scope creep and locks acceptance criteria.
Harness Engineer
Builds task packets, the event store, policy gates, and the retry/rollback logic.
Agent Engineer
Builds role prompts, tool policies, model routing, and CLI integration.
QA / Operator
Owns tests, judge rubrics, telemetry dashboards, and demo stability.
The point is that the Lead does not make every decision. The Operator reads metrics and the dashboard for live signal.
A good capstone does not announce a finished product on day one. Split must-haves and won’t-haves explicitly.
| Bucket | Examples | Decision rule |
|---|---|---|
| Must have | task packet, single worker loop, deterministic gate, event log | without it, there is no closed loop |
| Should have | reviewer / judge, replay snapshot, simple dashboard | strengthens the final evidence |
| Could have | web UI, multi-model router, agent marketplace | only if time remains |
| Won’t have | full autonomous deploy, external account control, complex permission delegation | risk and cost too high |
The Week 13 design document must contain a Won't have list. A plan without scope cuts is a wish list, not a plan.
A natural-language sentence is too unstable for an agent. In the capstone, every task is sent as a packet.
task_id: capstone-017objective: "Add retry handling to the GitHub issue importer"scope: files: - src/importer/github.py - tests/test_github_importer.pyallowed_tools: - read_file - edit_file - run_testsacceptance: - "pytest tests/test_github_importer.py passes" - "No network call in unit tests" - "Retry count is configurable"budget: max_turns: 6 max_tokens: 120000escalation: ask_human_if: - "API contract must change" - "Secret or credential is required"| Field | Good | Borderline | Anti-pattern |
|---|---|---|---|
| Objective | verb + measurable outcome | vague verb | ”make it better” |
| Scope | files / dirs explicit | ”related code” | entire repo |
| Allowed tools | limited to read/edit/run_tests | broad | ”any” |
| Acceptance | 3-5 pass/fail items | one item | none |
| Budget | turns and tokens specified | one of the two | unlimited |
| Escalation | condition + owner | condition only | none |
{ "type": "object", "required": ["task_id", "objective", "scope", "acceptance", "budget"], "properties": { "task_id": {"type": "string", "pattern": "^[a-z0-9-]{4,}$"}, "objective": {"type": "string", "minLength": 10}, "scope": { "type": "object", "properties": { "files": {"type": "array", "items": {"type": "string"}} } }, "allowed_tools": {"type": "array", "items": {"type": "string"}}, "acceptance": { "type": "array", "minItems": 1, "items": {"type": "string"} }, "budget": { "type": "object", "required": ["max_turns", "max_tokens"], "properties": { "max_turns": {"type": "integer", "maximum": 15}, "max_tokens": {"type": "integer"} } }, "escalation": {"type": "object"} }}Promising not to accept any task that fails this schema validator pays large dividends once the capstone starts.
Do not list risks abstractly. Use a STRIDE-inspired table.
| Threat category | Capstone application | Example |
|---|---|---|
| Spoofing | model identity | another model answers under the same alias |
| Tampering | task / artifact tampering | event log post-processed, results forgeable |
| Repudiation | accountability dodge | overrides happen anonymously |
| Information disclosure | sensitive data leak | code or secrets sent to an external API |
| Denial of service | resource exhaustion | infinite retry, queue overflow |
| Elevation of privilege | privilege overreach | edits files outside the allowed scope |
For each row, state in one line how the current plan blocks the threat. If there is no block, that is a Week 14 to-build item.
Each team writes ADRs for major decisions. At least three are required.
# ADR-001: Use vLLM OpenAI-compatible API
## ContextWe need to compare local and commercial models through the same harness.
## DecisionExpose local models through vLLM's OpenAI-compatible API and route calls through one client wrapper.
## Consequences- Good: model provider can be swapped without changing agent code.- Bad: model-specific tool-call parsers still require configuration.| Aspect | Good ADR | Bad ADR |
|---|---|---|
| Context | one or two sentences on “why now” | a corporate vision |
| Decision | one sentence + the alternative we rejected | ”Use X” |
| Consequences | both Good and Bad | only Good |
| Owner | one named person | anonymous |
| Date | the decision date | empty |
# Team Name Capstone Design
## 1. Problem## 2. Users and Risk Boundary## 3. Agent Architecture## 4. Runtime Layers## 5. Task Packet Schema## 6. Evaluation Gates## 7. Telemetry and Replay## 8. Implementation Plan## 9. Demo ScenarioRisks belong in a structured register: trigger, owner, response.
| Risk | Trigger | Owner | Response |
|---|---|---|---|
| Model output repeatedly violates JSON schema | invalid JSON rate > 20% | Agent Engineer | structured output or repair step |
| No tests, so quality cannot be evaluated | deterministic gate is empty | QA / Operator | minimum smoke test + fixture |
| Scope creep | happy path not passing by end of Week 14 | Lead | drop all “Could have” items |
| Cost overrun | run-level token budget exceeded | Harness Engineer | apply max_turns, context trim, cache prefix |
| Unstable demo | one of three repeated runs fails | QA / Operator | shrink live demo, prepare a recording |
This table is used as input for the Week 14 midpoint report and the Week 15 release gate.
Define the team’s problem
Pick a real, repeatable task and define the user in one sentence.
Write success criteria
Replace “good” with five pass/fail criteria.
Decompose agent roles
Pick only the necessary roles among Lead / Planner / Worker / Reviewer / Operator.
Map runtime layers
For each layer, decide which file, script, log, or policy belongs there.
Author the task packet schema
Attach one JSON Schema and three examples (good / borderline / anti-pattern).
Write three ADRs
One for model selection, one for harness library, one for evaluation strategy.
Pin the demo path
Choose one happy path and one failure-recovery path to show in the final presentation.
Due: 2026-06-02 23:59
Submission path: capstone/teams/[team]/design.md
Required:
Foundational
ADR / design culture
Evaluation / risk
Capstone artifacts