Skip to content

Week 16: Final Presentations and Course Wrap-Up

Phase 5Week 16Final PresentationLecture: 2026-06-16

Systems thinking

Treat AI not as a model but as a system of governance, tool boundary, events, gates, and humans.

Design ability

Design task packets, runtime layers, gate policies, and escalation paths — and leave decision traces in ADRs.

Implementation ability

Wire Ralph loops, MCP, vLLM, OpenTelemetry, and LLM-as-Judge into a closed-loop MVP.

Operational ability

Operate telemetry, replay, release gates, and human override; report cost and failure rate quantitatively.


The final presentation is not “show off a slick demo.” It is the moment to prove you can treat an AI system as an engineering object. Evaluators want to see:

  1. Which repeated work did you solve?
  2. How far did the agent run autonomously?
  3. How did the harness contain failure?
  4. On what grounds did you judge result quality?
  5. What will you cut or strengthen in the next version?
TimeTeamNotes
09:00-09:15Team 112-min talk + 3-min Q&A
09:20-09:35Team 212-min talk + 3-min Q&A
09:40-09:55Team 312-min talk + 3-min Q&A
10:00-10:15Team 412-min talk + 3-min Q&A
10:20-10:35Team 512-min talk + 3-min Q&A
10:40-11:00Break + extended Q&A
11:00-11:30Peer review
11:30-12:00Course wrap-up

Each talk is graded out of 100.

ItemPointsCore question
Technical completeness25Does the closed loop actually run?
Harness design25Are gate, retry, replay, and policy implemented?
Problem framing15Is the chosen task appropriate and repeatable?
Observability and evaluation20Are numbers, event logs, and judge/test results connected?
Presentation and reflection15Are failures and scope cuts shared honestly?

Every claim on the slide must come with evidence.

ClaimRequired evidence
The system workslive demo or recording + run id
It is repeatablethree runs with the same task packet
It fails safelythe failure-path event log
Quality was evaluateda deterministic-test + judge + human-review table
Cost is understooda token / GPU / operator cost formula
Next step is knowna backlog grounded in failure reasons

Cut any claim without evidence.

Peer review is engineering review, not a popularity vote. For each team, write five answers.

  1. The most convincing design decision.
  2. The most dangerous failure mode.
  3. A gate this system would need to be production-ready.
  4. The clearest number or piece of evidence in the talk.
  5. One-sentence improvement suggestion.
Area1 point3 points5 points
Problem clarity”wanted to try AI”one user / one taskquantified repeated cost
System designsingle promptrole separationruntime-layer mapping
Evidencedemo video onlyone runthree + replay + judge
Failure sharinghiddenpartialfailure log + response
Next stepsvaguefeature additionsgate / observability priorities

The patterns you built in Phases 4-5 line up directly with the trends after 2026. Six axes will keep guiding you whatever you build next.

2026-2027: Harness as product

Codex, Claude Code, Gemini CLI, GitHub Agent HQ — the differentiator is not the model but execution environment, permissions, sandbox, MCP, and review workflow.

Agent Ops as a discipline

MLOps owned model deployment. Agent Ops will own tool boundary, memory, event log, policy gate, and human override.

Eval-first culture

LLM-as-Judge, human-in-the-loop, and offline eval sets become first-line criteria for model selection and release.

Memory & long context

Beyond 1M context, the new design variable is what to keep, summarize, and forget — and how.

Tool / Skill marketplace

MCP, skills, and subagents become OS-level interfaces, with security, billing, and evaluation following along.

Compliance & safety

EU AI Act, ISO 42001, NIST AI RMF — governance-as-code becomes part of the engineering standard.

Capabilities that travel beyond this course

Section titled “Capabilities that travel beyond this course”
What you learnedPortfolio framing
HOTL governancedesigned human approval and automated policy gates as separate concerns
MCP and tool boundaryconstrained the agent’s tool surface to be auditable
Ralph Loopimplemented repeated execution with deterministic backpressure and rollback
Context engineeringdesigned instruction, task packet, memory, and compaction strategy
vLLM / MLOpsserved local models with an OpenAI-compatible API plus telemetry
LLM-as-Judgeintegrated probabilistic evaluation with deterministic gates

Week 16 is when Weeks 1-15 become one connected story.

From Phase 1 Concepts to Capstone Artifacts
HOTL · MIG · MCPPhase 1→ approval / resource boundary / tool boundary
Ralph Loop · Context · InstructionPhase 2→ repeated run / state file / PROMPT.md
Multi-agent SDLCPhase 3→ Lead / Planner / Worker / Reviewer
vLLM · Telemetry · JudgePhase 4→ serving / OTel / event store / LLM-as-Judge
RalphthonPhase 5→ closed-loop MVP / release gate / final demo
SynthesisFinal presentation→ AI as an engineering target
Early conceptHow it surfaces in the capstone
HOTLhuman approval, override, peer review
MIG / resource boundarymodel serving budget, per-team GPU/queue separation
MCP / tool boundaryallowed tools, denied tools, audit log
Ralph Looprepeated runs, deterministic backpressure
Context engineeringtask packet, instruction, compact replay state
Multi-agent SDLCLead, Planner, Worker, Reviewer handoff
MLOps / telemetryvLLM metrics, OTel spans, event store

A team that explains these connections is presenting a synthesis of the whole course, not just a demo.

A worked example so students can copy the structure.

Case Study — “Repo Doctor” Team Final Demo
SituationDepartment GitHub: 30+ PRs/week. Reviewer fatigue, reaction time 38h average.
TaskRead PR diffs, auto-generate three risk-ranked review comments. Human approves the merge.
Actiontask packet · MCP read-only tools · pytest gate · LLM Judge · OTel · replay snapshot.
Result3-run avg reaction 38h → 11h. judge correlation Spearman 0.78. cost / PR $0.04 (local) vs $0.18 (api).
ReflectionOne false-pass (1 override). Next: secret-scan policy gate, paired-reviewer mode.

In a personal portfolio, “what I built” is weaker than “what engineering judgments I made.”

## Ralphthon Capstone: Agentic Code Review Harness
- Built a human-on-the-loop agent runtime for repeatable PR review tasks.
- Designed task packets, scoped tool permissions, event-sourced run logs, and replay snapshots.
- Integrated deterministic gates (pytest, schema validation) with an LLM-as-Judge quality rubric.
- Served local coding models through a vLLM OpenAI-compatible API and compared cost/latency against commercial APIs.
- Reported 3-run success rate, token cost, latency, and failure recovery behavior.
BeatRecommended phrasing
Situation”User Y had problem X with cost Z”
Task”We solved scope A; out of scope was B”
Action”Three ADRs + task packet + gate policy”
Result”Numbers + run id + limitations”
Engineering decisions”Why this model / harness / evaluation”
Next”One or two items to cut or harden in the next version”

The first six months after graduation are the riskiest. Pre-writing milestones keeps direction.

Next 6 months

Add one new model or task type to the capstone system. Read one SRE primer. Accumulate ten ADRs.

Next 1 year

Operate for five real users. Catalog 30 failure modes. Submit one Agent Ops conference talk.

Next 2 years

Move into a platform role on a team. Publish one MLOps + LLMOps + Agent Ops integration writeup.

  1. What was the largest scope cut?

    Frame the cut as an engineering decision, not a failure.

  2. Where did the AI repeatedly fail?

    Do not blame the model — describe what changed in instruction / gate / context.

  3. What dangerous behavior would have happened without the harness?

    Pick one of file writes, external APIs, cost, security, or wrong judgments and be specific.

  4. What is the next version’s first improvement?

    Prioritize observability, evaluation, and stability before new features.

This course can be reduced to one sentence.

The stronger the model, the more important the harness around it.

Your capstone may not be a finished product. But if you connected task packet, tool boundary, event log, policy gate, judge, and telemetry yourself, you already treated AI as engineering rather than as prompts.

  1. Deliver the final presentation

    15 minutes in STAR order. Live demo + 90-second fallback.

  2. Write peer reviews

    For each team, fill the five questions and the 1-3-5 grid.

  3. Answer the four reflection prompts

    One paragraph each on scope cut, repeated failures, harness value, and next improvements.

  4. Author a one-page portfolio narrative

    Five sentences in STAR + Engineering Decisions form, with headline numbers.

  5. File a continuing-learning roadmap

    Save the 6-month / 1-year / 2-year cards in your personal notes.

Due: 2026-06-23 23:59

Peer review:

  1. Five-question evaluation per team
  2. 1-3-5 scoring grid in five areas
  3. One most-instructive design decision
  4. One most-dangerous failure mode

Personal reflection:

  1. Your role and actual contribution
  2. Where AI failed and how the harness compensated
  3. Three principles you will carry to the next project
  4. A five-sentence portfolio summary
  5. Three roadmap cards (6 months / 1 year / 2 years)
  1. It is a system, not a model: AI is a system that bundles governance, tools, events, gates, and humans.
  2. Cut every claim without evidence: every line on the slide should map to a run id, an event log entry, or a dashboard.
  3. Peer review is review, not applause: name the most dangerous failure mode and the next gate to add.
  4. Read the future across six axes: harness as product / Agent Ops / Eval-first / Memory / Marketplace / Compliance.
  5. Portfolio = decision trace: what you rejected is a stronger signal than what you built.
  6. Carry three roadmap cards: 6 months / 1 year / 2 years pre-written goals keep you on track.
  7. The course in one sentence: the stronger the model, the more important the harness around it.

Career / learning roadmap

  • Will Larson, Staff Engineer
  • Tanya Reilly, The Staff Engineer’s Path
  • Charity Majors, “The Engineer / Manager Pendulum”

Outlook / policy

  • Anthropic, “Building Anthropic” (recurring updates)
  • EU AI Act full text / digest
  • NIST AI RMF 1.0
  • Stanford HAI AI Index Report

Community

  • LangChain · LlamaIndex · MCP community channels
  • KAIST AI Systems Seminar, Korean Software Engineering Society

Questions or feedback: yj.lee@chu.ac.kr or GitHub Issue