Skip to content

Week 15: System Integration and Final Testing

Phase 5Week 15Wrap-upLecture: 2026-06-09

Concepts

Explain how a release gate differs from a deterministic gate and compare four integration-test strategies — smoke, integration, soak, chaos.

Design

Author a feature-freeze policy and a release-readiness checklist, and structure the presentation with the STAR framework.

Implementation

Pass release-gate items — three repeats of the same task packet, replay snapshot verification, automated failure-recovery path — using code and evidence.

Operations

Attach token, cost, latency, and failure-rate tables to the slides with source run ids, turning a “results page” into a “traceable results page”.


Week 15 is not for adding features. It is for turning the closed loop you already built into a trustworthy demo and report. Reduce failure rate, finalize the numbers, and lock the presentation flow rather than adding capability.

Week 15 — Integration to Presentation
① Declare feature freezefreeze.md authored · scope cuts confirmed
② Run the integration matrixsmoke · integration · soak · chaos
③ Pass the release gatebuild/test · 3× E2E · replay · evaluation · security · demo
④ Performance / cost reportlatency / tokens / cost / judge / pass-fail
⑤ Presentation rehearsal (STAR)cold start · time-box · failure injection · evidence check

A release gate is not a single test — it is a policy that bundles multiple gates, each with one of three characters: deterministic, probabilistic, or approval.

GateTypePass criterionExample
Build/Testdeterministicexit code 0make test, pytest
SchemadeterministicJSON Schema validatestask packet, judge output
Replaydeterministicevent log → snapshot identicalreplay() output match
Performancedeterministic-ishthreshold checkp95 latency < N
Judgeprobabilisticoverall ≥ 7.0 etc.LLM Judge JSON
Securityapprovalmanual checklist signedsecret scan + manual review
Demoapprovallive demo + recorded backup90-second video

Each team writes freeze.md at the start of Week 15.

ItemContent
Demo paththe one happy path the presentation must show
Recovery paththe one failure-handling path
Frozen featuresfeatures that will not change anymore
Allowed fixesbug fixes, test stabilization, doc / number polish
Explicit cutsfeatures intentionally excluded from the final demo

After the freeze, changes are allowed only if they raise the release-gate pass rate.

Pass every gate before final submission.

GateCriterionEvidence
Build/Testmake test (or equivalent) passesCI log or terminal capture
End-to-endthree repeats of the same task packetthree run ids
Replayreconstruct final state from the event logreplay_snapshot.json
Evaluationrecord deterministic + judge resultsevaluation table
Securityno secret, credential, or out-of-scope writeschecklist
Demooffline-explainable backup materialrecording / screenshots

Before the final rehearsal, run a 20-minute internal release review.

  1. Run make test (or equivalent) and capture the result.
  2. Run the same task packet three times and record the run ids.
  3. Pick the most recent failed run and explain its failure reason.
  4. Confirm the replay_snapshot.json matches the event log.
  5. Annotate every number on the slide with its source run id.
  6. Cross-check that no secrets, credentials, or out-of-scope writes leaked.

Issues found during this review take priority over new features.

Capstone systems are not validated by unit tests alone. Run four kinds of integration tests in balance.

Test typePurposePass criterionFrequency
Smokedoes the system come up?make run onceevery commit
Integrationdoes the end-to-end happy path pass?three repeats cleanonce a day
Soakdoes memory / queue stay stable over time?no OOM in 10 min / 100 callsonce before presentation
Chaosdoes the system recover from injected faults?invalid task / network drop / model 500 → safe closeonce before presentation
task packet -> planner -> worker -> tests pass -> judge pass -> artifact saved

Run at least three times. If even one fails, log the failure reason and fix the cause.

Each team includes at minimum the following table.

IndicatorRun 1Run 2Run 3Meanvs baseline
total latency sec
prompt tokens
completion tokens
tool calls
retry count
judge score
pass/fail

Cost is a comparison formula, not a billed amount.

commercial_api_cost = input_tokens * input_price + output_tokens * output_price
local_gpu_cost = gpu_hour_price * elapsed_seconds / 3600
operator_cost = human_minutes * hourly_rate / 60

Three to five runs is too few to claim a clean mean. Add a single line that improves reliability.

ItemRecommended notation
Averagemean ± stdev
Distributionmin / median / p95
Sample sizen=N (state explicitly if < 3)
Failure ratefailures / total

Presentation storytelling — the STAR framework

Section titled “Presentation storytelling — the STAR framework”

A demo by itself does not let evaluators tell luck from design. STAR provides intent.

BeatSTAR meaningCapstone application
Situationwhy this problemuser / repeated task / failure cost
Taskthe scope you solveddemo path · scope cut
Actionhow you built ittask packet · gate · replay design choices
Resultquantitative evidencesuccess rate · cost · judge · failure cases

The recommended length is 15 minutes.

TimeSectionEvidence
2 minProblem (Situation)user / repeated task / risk boundary
3 minArchitecture (Task + Action)agent diagram + runtime layers
5 minLive demo (Action)happy path + failure recovery
3 minResults (Result)success rate, cost, latency, judge
2 minReflectionfailures and scope cuts, next improvements
TechniqueDescriptionWhen to apply
Pre-warmload the model and warm caches one hour before the talklive demo
Recorded fallbacka 90-second recording ready to playas soon as the live run stalls
Frozen task packetdemo task packet pinned to a git tagright before the talk
Hardcoded inputspre-captured external API responses used as fixtureschaos scenarios
Watchdog timera 90-second live demo timer auto-switches to fallbackpresentation mode
# Final Report
## 1. Problem and User
## 2. System Architecture
## 3. Runtime and Harness Design
## 4. Evaluation Gates
## 5. Telemetry Results
## 6. Failure Cases and Fixes
## 7. Cost and Performance
## 8. Limitations
## 9. What We Would Do Next
  1. Cold-start rehearsal

    Start the presentation environment from scratch, with nothing pre-loaded.

  2. Time-box rehearsal

    Speak under 15 minutes, ending the demo at the 12-minute mark.

  3. Failure injection

    Deliberately submit a failing task and verify the system halts safely on stage.

  4. Evidence check

    Confirm every number ties back to a run id, an event log entry, and a dashboard screenshot.

  5. Speaker handoff

    Pre-write transition lines so the talk continues even if one teammate is missing.

ItemPathDue
Source codecapstone/teams/[team]/runtime/6/16
READMEcapstone/teams/[team]/README.md6/16
Design doccapstone/teams/[team]/design.md6/16
Final reportcapstone/teams/[team]/reports/final-report.md6/16
Slidescapstone/teams/[team]/presentation.pdf6/16
Demo videocapstone/teams/[team]/demo.mp46/16
Run logscapstone/teams/[team]/runs/*.events.jsonl6/16
Replay snapshotcapstone/teams/[team]/reports/replay_snapshot.json6/16
freeze.mdcapstone/teams/[team]/freeze.md6/16
CategoryWeightEvaluation points
Problem framing and scope15%real repeated task, risk boundary, scope cuts
Harness design25%task packet, gate, retry, replay
Implementation completeness25%E2E execution, tests, stability
Observability and evaluation20%telemetry, judge, cost / performance numbers
Presentation and reflection15%clarity, sharing failures, next improvements
  1. Author freeze.md

    Fill the five fields: demo path / recovery path / frozen features / allowed fixes / explicit cuts.

  2. Run the integration matrix

    Run smoke / integration / soak / chaos at least once each, recording run ids.

  3. Pass the six release gates

    Mark each of build/test, 3× E2E, replay, evaluation, security, and demo with evidence.

  4. Build the performance / cost table

    Record three-to-five-run averages and distributions, plus a commercial-vs-local cost comparison.

  5. STAR slides v1

    Construct the slide skeleton along Situation / Task / Action / Result.

  6. Three rehearsals

    Run cold start, time-box, and failure injection back to back.

Due: 2026-06-16 23:59

Required:

  1. Complete every item on the final-deliverables list
  2. Include the release-gate checklist
  3. Include performance / cost / quality numbers tied to run ids
  4. Include 15-minute slides and a 90-second backup demo video
  5. Attach freeze.md and the integration matrix results
  1. Week 15 is integration and stabilization: demo-path reliability outranks new features.
  2. A release gate is not a single test: it is a policy that bundles deterministic, probabilistic, and approval gates.
  3. freeze.md is the deciding tool: pin what will not change so fixes and features stop blurring together.
  4. Integration testing is a four-way matrix: smoke / integration / soak / chaos.
  5. Numbers connect to source run ids: every on-slide number must trace back to a run id, an event log entry, or a dashboard.
  6. STAR structures the talk: Situation·Task·Action·Result reveals a demo as design, not luck.
  7. Demos always run on two tracks: live + 90-second recorded fallback, switched by a watchdog timer.

Foundational

Testing / reliability

  • Google SRE Book — Testing for Reliability
  • Netflix Tech Blog — Chaos Engineering 101
  • Will Larson, “Test types and trade-offs”

Presentation / reporting

  • Cole Nussbaumer Knaflic, Storytelling with Data
  • Amazon 6-pager / 1-pager writing guide
  • Andy Matuschak, “Evergreen notes” — reusing presentation material