Week 16: Final Presentations and Course Wrap-Up

Phase 5Week 16Final PresentationLecture: 2026-06-16

Theory

Capabilities you take from this course

Systems thinking

Treat AI not as a model but as a system of governance, tool boundary, events, gates, and humans.

Design ability

Design task packets, runtime layers, gate policies, and escalation paths — and leave decision traces in ADRs.

Implementation ability

Wire Ralph loops, MCP, vLLM, OpenTelemetry, and LLM-as-Judge into a closed-loop MVP.

Operational ability

Operate telemetry, replay, release gates, and human override; report cost and failure rate quantitatively.

The point of the final presentation

The final presentation is not “show off a slick demo.” It is the moment to prove you can treat an AI system as an engineering object. Evaluators want to see:

Which repeated work did you solve?
How far did the agent run autonomously?
How did the harness contain failure?
On what grounds did you judge result quality?
What will you cut or strengthen in the next version?

Final presentation schedule

Time	Team	Notes
09:00-09:15	Team 1	12-min talk + 3-min Q&A
09:20-09:35	Team 2	12-min talk + 3-min Q&A
09:40-09:55	Team 3	12-min talk + 3-min Q&A
10:00-10:15	Team 4	12-min talk + 3-min Q&A
10:20-10:35	Team 5	12-min talk + 3-min Q&A
10:40-11:00	Break + extended Q&A
11:00-11:30	Peer review
11:30-12:00	Course wrap-up

Presentation rubric

Each talk is graded out of 100.

Item	Points	Core question
Technical completeness	25	Does the closed loop actually run?
Harness design	25	Are gate, retry, replay, and policy implemented?
Problem framing	15	Is the chosen task appropriate and repeatable?
Observability and evaluation	20	Are numbers, event logs, and judge/test results connected?
Presentation and reflection	15	Are failures and scope cuts shared honestly?

Evidence checklist

Every claim on the slide must come with evidence.

Claim	Required evidence
The system works	live demo or recording + run id
It is repeatable	three runs with the same task packet
It fails safely	the failure-path event log
Quality was evaluated	a deterministic-test + judge + human-review table
Cost is understood	a token / GPU / operator cost formula
Next step is known	a backlog grounded in failure reasons

Cut any claim without evidence.

Peer-review criteria

Peer review is engineering review, not a popularity vote. For each team, write five answers.

The most convincing design decision.
The most dangerous failure mode.
A gate this system would need to be production-ready.
The clearest number or piece of evidence in the talk.
One-sentence improvement suggestion.

Peer-review scoring grid

Area	1 point	3 points	5 points
Problem clarity	”wanted to try AI”	one user / one task	quantified repeated cost
System design	single prompt	role separation	runtime-layer mapping
Evidence	demo video only	one run	three + replay + judge
Failure sharing	hidden	partial	failure log + response
Next steps	vague	feature additions	gate / observability priorities

The future of AI systems

The patterns you built in Phases 4-5 line up directly with the trends after 2026. Six axes will keep guiding you whatever you build next.

2026-2027: Harness as product

Codex, Claude Code, Gemini CLI, GitHub Agent HQ — the differentiator is not the model but execution environment, permissions, sandbox, MCP, and review workflow.

Agent Ops as a discipline

MLOps owned model deployment. Agent Ops will own tool boundary, memory, event log, policy gate, and human override.

Eval-first culture

LLM-as-Judge, human-in-the-loop, and offline eval sets become first-line criteria for model selection and release.

Memory & long context

Beyond 1M context, the new design variable is what to keep, summarize, and forget — and how.

Tool / Skill marketplace

MCP, skills, and subagents become OS-level interfaces, with security, billing, and evaluation following along.

Compliance & safety

EU AI Act, ISO 42001, NIST AI RMF — governance-as-code becomes part of the engineering standard.

Capabilities that travel beyond this course

What you learned	Portfolio framing
HOTL governance	designed human approval and automated policy gates as separate concerns
MCP and tool boundary	constrained the agent’s tool surface to be auditable
Ralph Loop	implemented repeated execution with deterministic backpressure and rollback
Context engineering	designed instruction, task packet, memory, and compaction strategy
vLLM / MLOps	served local models with an OpenAI-compatible API plus telemetry
LLM-as-Judge	integrated probabilistic evaluation with deterministic gates

Course-wide synthesis

Week 16 is when Weeks 1-15 become one connected story.

From Phase 1 Concepts to Capstone Artifacts

HOTL · MIG · MCPPhase 1→ approval / resource boundary / tool boundary

Ralph Loop · Context · InstructionPhase 2→ repeated run / state file / PROMPT.md

Multi-agent SDLCPhase 3→ Lead / Planner / Worker / Reviewer

vLLM · Telemetry · JudgePhase 4→ serving / OTel / event store / LLM-as-Judge

RalphthonPhase 5→ closed-loop MVP / release gate / final demo

SynthesisFinal presentation→ AI as an engineering target

Early concept	How it surfaces in the capstone
HOTL	human approval, override, peer review
MIG / resource boundary	model serving budget, per-team GPU/queue separation
MCP / tool boundary	allowed tools, denied tools, audit log
Ralph Loop	repeated runs, deterministic backpressure
Context engineering	task packet, instruction, compact replay state
Multi-agent SDLC	Lead, Planner, Worker, Reviewer handoff
MLOps / telemetry	vLLM metrics, OTel spans, event store

A team that explains these connections is presenting a synthesis of the whole course, not just a demo.

Capstone case study (fictional team)

A worked example so students can copy the structure.

Case Study — “Repo Doctor” Team Final Demo

SituationDepartment GitHub: 30+ PRs/week. Reviewer fatigue, reaction time 38h average.

▼

TaskRead PR diffs, auto-generate three risk-ranked review comments. Human approves the merge.

▼

Actiontask packet · MCP read-only tools · pytest gate · LLM Judge · OTel · replay snapshot.

▼

Result3-run avg reaction 38h → 11h. judge correlation Spearman 0.78. cost / PR $0.04 (local) vs $0.18 (api).

▼

ReflectionOne false-pass (1 override). Next: secret-scan policy gate, paired-reviewer mode.

Designing the portfolio narrative

In a personal portfolio, “what I built” is weaker than “what engineering judgments I made.”

## Ralphthon Capstone: Agentic Code Review Harness

- Built a human-on-the-loop agent runtime for repeatable PR review tasks.
- Designed task packets, scoped tool permissions, event-sourced run logs, and replay snapshots.
- Integrated deterministic gates (pytest, schema validation) with an LLM-as-Judge quality rubric.
- Served local coding models through a vLLM OpenAI-compatible API and compared cost/latency against commercial APIs.
- Reported 3-run success rate, token cost, latency, and failure recovery behavior.

STAR + Engineering Decisions

Beat	Recommended phrasing
Situation	”User Y had problem X with cost Z”
Task	”We solved scope A; out of scope was B”
Action	”Three ADRs + task packet + gate policy”
Result	”Numbers + run id + limitations”
Engineering decisions	”Why this model / harness / evaluation”
Next	”One or two items to cut or harden in the next version”

Continuing-learning roadmap

The first six months after graduation are the riskiest. Pre-writing milestones keeps direction.

Next 6 months

Add one new model or task type to the capstone system. Read one SRE primer. Accumulate ten ADRs.

Next 1 year

Operate for five real users. Catalog 30 failure modes. Submit one Agent Ops conference talk.

Next 2 years

Move into a platform role on a team. Publish one MLOps + LLMOps + Agent Ops integration writeup.

Personal reflection prompts

What was the largest scope cut?

Frame the cut as an engineering decision, not a failure.
Where did the AI repeatedly fail?

Do not blame the model — describe what changed in instruction / gate / context.
What dangerous behavior would have happened without the harness?

Pick one of file writes, external APIs, cost, security, or wrong judgments and be specific.
What is the next version’s first improvement?

Prioritize observability, evaluation, and stability before new features.

Course wrap-up

This course can be reduced to one sentence.

The stronger the model, the more important the harness around it.

Your capstone may not be a finished product. But if you connected task packet, tool boundary, event log, policy gate, judge, and telemetry yourself, you already treated AI as engineering rather than as prompts.

Practicum

Deliver the final presentation

15 minutes in STAR order. Live demo + 90-second fallback.
Write peer reviews

For each team, fill the five questions and the 1-3-5 grid.
Answer the four reflection prompts

One paragraph each on scope cut, repeated failures, harness value, and next improvements.
Author a one-page portfolio narrative

Five sentences in STAR + Engineering Decisions form, with headline numbers.
File a continuing-learning roadmap

Save the 6-month / 1-year / 2-year cards in your personal notes.

Assignment

Peer review and personal reflection

Due: 2026-06-23 23:59

Peer review:

Five-question evaluation per team
1-3-5 scoring grid in five areas
One most-instructive design decision
One most-dangerous failure mode

Personal reflection:

Your role and actual contribution
Where AI failed and how the harness compensated
Three principles you will carry to the next project
A five-sentence portfolio summary
Three roadmap cards (6 months / 1 year / 2 years)

Key Takeaways

It is a system, not a model: AI is a system that bundles governance, tools, events, gates, and humans.
Cut every claim without evidence: every line on the slide should map to a run id, an event log entry, or a dashboard.
Peer review is review, not applause: name the most dangerous failure mode and the next gate to add.
Read the future across six axes: harness as product / Agent Ops / Eval-first / Memory / Marketplace / Compliance.
Portfolio = decision trace: what you rejected is a stronger signal than what you built.
Carry three roadmap cards: 6 months / 1 year / 2 years pre-written goals keep you on track.
The course in one sentence: the stronger the model, the more important the harness around it.