Week 15: System Integration and Final Testing

Phase 5Week 15Wrap-upLecture: 2026-06-09

Theory

Learning Objectives

Concepts

Explain how a release gate differs from a deterministic gate and compare four integration-test strategies — smoke, integration, soak, chaos.

Design

Author a feature-freeze policy and a release-readiness checklist, and structure the presentation with the STAR framework.

Implementation

Pass release-gate items — three repeats of the same task packet, replay snapshot verification, automated failure-recovery path — using code and evidence.

Operations

Attach token, cost, latency, and failure-rate tables to the slides with source run ids, turning a “results page” into a “traceable results page”.

The integration week’s purpose

Week 15 is not for adding features. It is for turning the closed loop you already built into a trustworthy demo and report. Reduce failure rate, finalize the numbers, and lock the presentation flow rather than adding capability.

Week 15 flow

Week 15 — Integration to Presentation

① Declare feature freezefreeze.md authored · scope cuts confirmed

▼

② Run the integration matrixsmoke · integration · soak · chaos

▼

③ Pass the release gatebuild/test · 3× E2E · replay · evaluation · security · demo

▼

④ Performance / cost reportlatency / tokens / cost / judge / pass-fail

▼

⑤ Presentation rehearsal (STAR)cold start · time-box · failure injection · evidence check

Release-gate design principles

A release gate is not a single test — it is a policy that bundles multiple gates, each with one of three characters: deterministic, probabilistic, or approval.

Gate	Type	Pass criterion	Example
Build/Test	deterministic	exit code 0	`make test`, `pytest`
Schema	deterministic	JSON Schema validates	task packet, judge output
Replay	deterministic	event log → snapshot identical	`replay()` output match
Performance	deterministic-ish	threshold check	p95 latency < N
Judge	probabilistic	overall ≥ 7.0 etc.	LLM Judge JSON
Security	approval	manual checklist signed	secret scan + manual review
Demo	approval	live demo + recorded backup	90-second video

Feature-freeze policy

Each team writes freeze.md at the start of Week 15.

Item	Content
Demo path	the one happy path the presentation must show
Recovery path	the one failure-handling path
Frozen features	features that will not change anymore
Allowed fixes	bug fixes, test stabilization, doc / number polish
Explicit cuts	features intentionally excluded from the final demo

After the freeze, changes are allowed only if they raise the release-gate pass rate.

Release-gate procedure

Pass every gate before final submission.

Gate	Criterion	Evidence
Build/Test	`make test` (or equivalent) passes	CI log or terminal capture
End-to-end	three repeats of the same task packet	three run ids
Replay	reconstruct final state from the event log	`replay_snapshot.json`
Evaluation	record deterministic + judge results	evaluation table
Security	no secret, credential, or out-of-scope writes	checklist
Demo	offline-explainable backup material	recording / screenshots

Release readiness review

Before the final rehearsal, run a 20-minute internal release review.

Run make test (or equivalent) and capture the result.
Run the same task packet three times and record the run ids.
Pick the most recent failed run and explain its failure reason.
Confirm the replay_snapshot.json matches the event log.
Annotate every number on the slide with its source run id.
Cross-check that no secrets, credentials, or out-of-scope writes leaked.

Issues found during this review take priority over new features.

Integration test matrix

Capstone systems are not validated by unit tests alone. Run four kinds of integration tests in balance.

Test type	Purpose	Pass criterion	Frequency
Smoke	does the system come up?	`make run` once	every commit
Integration	does the end-to-end happy path pass?	three repeats clean	once a day
Soak	does memory / queue stay stable over time?	no OOM in 10 min / 100 calls	once before presentation
Chaos	does the system recover from injected faults?	invalid task / network drop / model 500 → safe close	once before presentation

task packet -> planner -> worker -> tests pass -> judge pass -> artifact saved

Run at least three times. If even one fails, log the failure reason and fix the cause.

invalid task -> schema validation fail -> no file write -> run.closed(failed)

Demonstrate that bad input halts the system safely.

tests fail -> reviewer says revise -> retry once -> pass or closed(failed)

Verify that the retry budget prevents infinite loops.

api 500 -> retry with backoff -> still 500 -> escalate to human + run.closed(failed)

Inject deliberate faults to prove recovery behavior.

Writing the performance / cost report

Each team includes at minimum the following table.

Indicator	Run 1	Run 2	Run 3	Mean	vs baseline
total latency sec
prompt tokens
completion tokens
tool calls
retry count
judge score
pass/fail

Cost is a comparison formula, not a billed amount.

commercial_api_cost = input_tokens * input_price + output_tokens * output_price
local_gpu_cost = gpu_hour_price * elapsed_seconds / 3600
operator_cost = human_minutes * hourly_rate / 60

Statistical reliability

Three to five runs is too few to claim a clean mean. Add a single line that improves reliability.

Item	Recommended notation
Average	mean ± stdev
Distribution	min / median / p95
Sample size	n=N (state explicitly if < 3)
Failure rate	failures / total

Presentation storytelling — the STAR framework

A demo by itself does not let evaluators tell luck from design. STAR provides intent.

Beat	STAR meaning	Capstone application
Situation	why this problem	user / repeated task / failure cost
Task	the scope you solved	demo path · scope cut
Action	how you built it	task packet · gate · replay design choices
Result	quantitative evidence	success rate · cost · judge · failure cases

Slide structure

The recommended length is 15 minutes.

Time	Section	Evidence
2 min	Problem (Situation)	user / repeated task / risk boundary
3 min	Architecture (Task + Action)	agent diagram + runtime layers
5 min	Live demo (Action)	happy path + failure recovery
3 min	Results (Result)	success rate, cost, latency, judge
2 min	Reflection	failures and scope cuts, next improvements

Demo stabilization techniques

Technique	Description	When to apply
Pre-warm	load the model and warm caches one hour before the talk	live demo
Recorded fallback	a 90-second recording ready to play	as soon as the live run stalls
Frozen task packet	demo task packet pinned to a git tag	right before the talk
Hardcoded inputs	pre-captured external API responses used as fixtures	chaos scenarios
Watchdog timer	a 90-second live demo timer auto-switches to fallback	presentation mode

Final report structure

# Final Report

## 1. Problem and User
## 2. System Architecture
## 3. Runtime and Harness Design
## 4. Evaluation Gates
## 5. Telemetry Results
## 6. Failure Cases and Fixes
## 7. Cost and Performance
## 8. Limitations
## 9. What We Would Do Next

Rehearsal checklist

Cold-start rehearsal

Start the presentation environment from scratch, with nothing pre-loaded.
Time-box rehearsal

Speak under 15 minutes, ending the demo at the 12-minute mark.
Failure injection

Deliberately submit a failing task and verify the system halts safely on stage.
Evidence check

Confirm every number ties back to a run id, an event log entry, and a dashboard screenshot.
Speaker handoff

Pre-write transition lines so the talk continues even if one teammate is missing.

Final deliverables

Item	Path	Due
Source code	`capstone/teams/[team]/runtime/`	6/16
README	`capstone/teams/[team]/README.md`	6/16
Design doc	`capstone/teams/[team]/design.md`	6/16
Final report	`capstone/teams/[team]/reports/final-report.md`	6/16
Slides	`capstone/teams/[team]/presentation.pdf`	6/16
Demo video	`capstone/teams/[team]/demo.mp4`	6/16
Run logs	`capstone/teams/[team]/runs/*.events.jsonl`	6/16
Replay snapshot	`capstone/teams/[team]/reports/replay_snapshot.json`	6/16
freeze.md	`capstone/teams/[team]/freeze.md`	6/16

Grading rubric

Category	Weight	Evaluation points
Problem framing and scope	15%	real repeated task, risk boundary, scope cuts
Harness design	25%	task packet, gate, retry, replay
Implementation completeness	25%	E2E execution, tests, stability
Observability and evaluation	20%	telemetry, judge, cost / performance numbers
Presentation and reflection	15%	clarity, sharing failures, next improvements

Practicum

Author freeze.md

Fill the five fields: demo path / recovery path / frozen features / allowed fixes / explicit cuts.
Run the integration matrix

Run smoke / integration / soak / chaos at least once each, recording run ids.
Pass the six release gates

Mark each of build/test, 3× E2E, replay, evaluation, security, and demo with evidence.
Build the performance / cost table

Record three-to-five-run averages and distributions, plus a commercial-vs-local cost comparison.
STAR slides v1

Construct the slide skeleton along Situation / Task / Action / Result.
Three rehearsals

Run cold start, time-box, and failure injection back to back.

Assignment

Capstone: final presentation package

Due: 2026-06-16 23:59

Required:

Complete every item on the final-deliverables list
Include the release-gate checklist
Include performance / cost / quality numbers tied to run ids
Include 15-minute slides and a 90-second backup demo video
Attach freeze.md and the integration matrix results

Key Takeaways

Week 15 is integration and stabilization: demo-path reliability outranks new features.
A release gate is not a single test: it is a policy that bundles deterministic, probabilistic, and approval gates.
freeze.md is the deciding tool: pin what will not change so fixes and features stop blurring together.
Integration testing is a four-way matrix: smoke / integration / soak / chaos.
Numbers connect to source run ids: every on-slide number must trace back to a run id, an event log entry, or a dashboard.
STAR structures the talk: Situation·Task·Action·Result reveals a demo as design, not luck.
Demos always run on two tracks: live + 90-second recorded fallback, switched by a watchdog timer.