Concepts
Explain how a release gate differs from a deterministic gate and compare four integration-test strategies — smoke, integration, soak, chaos.
Concepts
Explain how a release gate differs from a deterministic gate and compare four integration-test strategies — smoke, integration, soak, chaos.
Design
Author a feature-freeze policy and a release-readiness checklist, and structure the presentation with the STAR framework.
Implementation
Pass release-gate items — three repeats of the same task packet, replay snapshot verification, automated failure-recovery path — using code and evidence.
Operations
Attach token, cost, latency, and failure-rate tables to the slides with source run ids, turning a “results page” into a “traceable results page”.
Week 15 is not for adding features. It is for turning the closed loop you already built into a trustworthy demo and report. Reduce failure rate, finalize the numbers, and lock the presentation flow rather than adding capability.
A release gate is not a single test — it is a policy that bundles multiple gates, each with one of three characters: deterministic, probabilistic, or approval.
| Gate | Type | Pass criterion | Example |
|---|---|---|---|
| Build/Test | deterministic | exit code 0 | make test, pytest |
| Schema | deterministic | JSON Schema validates | task packet, judge output |
| Replay | deterministic | event log → snapshot identical | replay() output match |
| Performance | deterministic-ish | threshold check | p95 latency < N |
| Judge | probabilistic | overall ≥ 7.0 etc. | LLM Judge JSON |
| Security | approval | manual checklist signed | secret scan + manual review |
| Demo | approval | live demo + recorded backup | 90-second video |
Each team writes freeze.md at the start of Week 15.
| Item | Content |
|---|---|
| Demo path | the one happy path the presentation must show |
| Recovery path | the one failure-handling path |
| Frozen features | features that will not change anymore |
| Allowed fixes | bug fixes, test stabilization, doc / number polish |
| Explicit cuts | features intentionally excluded from the final demo |
After the freeze, changes are allowed only if they raise the release-gate pass rate.
Pass every gate before final submission.
| Gate | Criterion | Evidence |
|---|---|---|
| Build/Test | make test (or equivalent) passes | CI log or terminal capture |
| End-to-end | three repeats of the same task packet | three run ids |
| Replay | reconstruct final state from the event log | replay_snapshot.json |
| Evaluation | record deterministic + judge results | evaluation table |
| Security | no secret, credential, or out-of-scope writes | checklist |
| Demo | offline-explainable backup material | recording / screenshots |
Before the final rehearsal, run a 20-minute internal release review.
make test (or equivalent) and capture the result.replay_snapshot.json matches the event log.Issues found during this review take priority over new features.
Capstone systems are not validated by unit tests alone. Run four kinds of integration tests in balance.
| Test type | Purpose | Pass criterion | Frequency |
|---|---|---|---|
| Smoke | does the system come up? | make run once | every commit |
| Integration | does the end-to-end happy path pass? | three repeats clean | once a day |
| Soak | does memory / queue stay stable over time? | no OOM in 10 min / 100 calls | once before presentation |
| Chaos | does the system recover from injected faults? | invalid task / network drop / model 500 → safe close | once before presentation |
task packet -> planner -> worker -> tests pass -> judge pass -> artifact savedRun at least three times. If even one fails, log the failure reason and fix the cause.
invalid task -> schema validation fail -> no file write -> run.closed(failed)Demonstrate that bad input halts the system safely.
tests fail -> reviewer says revise -> retry once -> pass or closed(failed)Verify that the retry budget prevents infinite loops.
api 500 -> retry with backoff -> still 500 -> escalate to human + run.closed(failed)Inject deliberate faults to prove recovery behavior.
Each team includes at minimum the following table.
| Indicator | Run 1 | Run 2 | Run 3 | Mean | vs baseline |
|---|---|---|---|---|---|
| total latency sec | |||||
| prompt tokens | |||||
| completion tokens | |||||
| tool calls | |||||
| retry count | |||||
| judge score | |||||
| pass/fail |
Cost is a comparison formula, not a billed amount.
commercial_api_cost = input_tokens * input_price + output_tokens * output_pricelocal_gpu_cost = gpu_hour_price * elapsed_seconds / 3600operator_cost = human_minutes * hourly_rate / 60Three to five runs is too few to claim a clean mean. Add a single line that improves reliability.
| Item | Recommended notation |
|---|---|
| Average | mean ± stdev |
| Distribution | min / median / p95 |
| Sample size | n=N (state explicitly if < 3) |
| Failure rate | failures / total |
A demo by itself does not let evaluators tell luck from design. STAR provides intent.
| Beat | STAR meaning | Capstone application |
|---|---|---|
| Situation | why this problem | user / repeated task / failure cost |
| Task | the scope you solved | demo path · scope cut |
| Action | how you built it | task packet · gate · replay design choices |
| Result | quantitative evidence | success rate · cost · judge · failure cases |
The recommended length is 15 minutes.
| Time | Section | Evidence |
|---|---|---|
| 2 min | Problem (Situation) | user / repeated task / risk boundary |
| 3 min | Architecture (Task + Action) | agent diagram + runtime layers |
| 5 min | Live demo (Action) | happy path + failure recovery |
| 3 min | Results (Result) | success rate, cost, latency, judge |
| 2 min | Reflection | failures and scope cuts, next improvements |
| Technique | Description | When to apply |
|---|---|---|
| Pre-warm | load the model and warm caches one hour before the talk | live demo |
| Recorded fallback | a 90-second recording ready to play | as soon as the live run stalls |
| Frozen task packet | demo task packet pinned to a git tag | right before the talk |
| Hardcoded inputs | pre-captured external API responses used as fixtures | chaos scenarios |
| Watchdog timer | a 90-second live demo timer auto-switches to fallback | presentation mode |
# Final Report
## 1. Problem and User## 2. System Architecture## 3. Runtime and Harness Design## 4. Evaluation Gates## 5. Telemetry Results## 6. Failure Cases and Fixes## 7. Cost and Performance## 8. Limitations## 9. What We Would Do NextCold-start rehearsal
Start the presentation environment from scratch, with nothing pre-loaded.
Time-box rehearsal
Speak under 15 minutes, ending the demo at the 12-minute mark.
Failure injection
Deliberately submit a failing task and verify the system halts safely on stage.
Evidence check
Confirm every number ties back to a run id, an event log entry, and a dashboard screenshot.
Speaker handoff
Pre-write transition lines so the talk continues even if one teammate is missing.
| Item | Path | Due |
|---|---|---|
| Source code | capstone/teams/[team]/runtime/ | 6/16 |
| README | capstone/teams/[team]/README.md | 6/16 |
| Design doc | capstone/teams/[team]/design.md | 6/16 |
| Final report | capstone/teams/[team]/reports/final-report.md | 6/16 |
| Slides | capstone/teams/[team]/presentation.pdf | 6/16 |
| Demo video | capstone/teams/[team]/demo.mp4 | 6/16 |
| Run logs | capstone/teams/[team]/runs/*.events.jsonl | 6/16 |
| Replay snapshot | capstone/teams/[team]/reports/replay_snapshot.json | 6/16 |
| freeze.md | capstone/teams/[team]/freeze.md | 6/16 |
| Category | Weight | Evaluation points |
|---|---|---|
| Problem framing and scope | 15% | real repeated task, risk boundary, scope cuts |
| Harness design | 25% | task packet, gate, retry, replay |
| Implementation completeness | 25% | E2E execution, tests, stability |
| Observability and evaluation | 20% | telemetry, judge, cost / performance numbers |
| Presentation and reflection | 15% | clarity, sharing failures, next improvements |
Author freeze.md
Fill the five fields: demo path / recovery path / frozen features / allowed fixes / explicit cuts.
Run the integration matrix
Run smoke / integration / soak / chaos at least once each, recording run ids.
Pass the six release gates
Mark each of build/test, 3× E2E, replay, evaluation, security, and demo with evidence.
Build the performance / cost table
Record three-to-five-run averages and distributions, plus a commercial-vs-local cost comparison.
STAR slides v1
Construct the slide skeleton along Situation / Task / Action / Result.
Three rehearsals
Run cold start, time-box, and failure injection back to back.
Due: 2026-06-16 23:59
Required:
Foundational
Testing / reliability
Presentation / reporting