Systems thinking
Treat AI not as a model but as a system of governance, tool boundary, events, gates, and humans.
Systems thinking
Treat AI not as a model but as a system of governance, tool boundary, events, gates, and humans.
Design ability
Design task packets, runtime layers, gate policies, and escalation paths — and leave decision traces in ADRs.
Implementation ability
Wire Ralph loops, MCP, vLLM, OpenTelemetry, and LLM-as-Judge into a closed-loop MVP.
Operational ability
Operate telemetry, replay, release gates, and human override; report cost and failure rate quantitatively.
The final presentation is not “show off a slick demo.” It is the moment to prove you can treat an AI system as an engineering object. Evaluators want to see:
| Time | Team | Notes |
|---|---|---|
| 09:00-09:15 | Team 1 | 12-min talk + 3-min Q&A |
| 09:20-09:35 | Team 2 | 12-min talk + 3-min Q&A |
| 09:40-09:55 | Team 3 | 12-min talk + 3-min Q&A |
| 10:00-10:15 | Team 4 | 12-min talk + 3-min Q&A |
| 10:20-10:35 | Team 5 | 12-min talk + 3-min Q&A |
| 10:40-11:00 | Break + extended Q&A | |
| 11:00-11:30 | Peer review | |
| 11:30-12:00 | Course wrap-up |
Each talk is graded out of 100.
| Item | Points | Core question |
|---|---|---|
| Technical completeness | 25 | Does the closed loop actually run? |
| Harness design | 25 | Are gate, retry, replay, and policy implemented? |
| Problem framing | 15 | Is the chosen task appropriate and repeatable? |
| Observability and evaluation | 20 | Are numbers, event logs, and judge/test results connected? |
| Presentation and reflection | 15 | Are failures and scope cuts shared honestly? |
Every claim on the slide must come with evidence.
| Claim | Required evidence |
|---|---|
| The system works | live demo or recording + run id |
| It is repeatable | three runs with the same task packet |
| It fails safely | the failure-path event log |
| Quality was evaluated | a deterministic-test + judge + human-review table |
| Cost is understood | a token / GPU / operator cost formula |
| Next step is known | a backlog grounded in failure reasons |
Cut any claim without evidence.
Peer review is engineering review, not a popularity vote. For each team, write five answers.
| Area | 1 point | 3 points | 5 points |
|---|---|---|---|
| Problem clarity | ”wanted to try AI” | one user / one task | quantified repeated cost |
| System design | single prompt | role separation | runtime-layer mapping |
| Evidence | demo video only | one run | three + replay + judge |
| Failure sharing | hidden | partial | failure log + response |
| Next steps | vague | feature additions | gate / observability priorities |
The patterns you built in Phases 4-5 line up directly with the trends after 2026. Six axes will keep guiding you whatever you build next.
2026-2027: Harness as product
Codex, Claude Code, Gemini CLI, GitHub Agent HQ — the differentiator is not the model but execution environment, permissions, sandbox, MCP, and review workflow.
Agent Ops as a discipline
MLOps owned model deployment. Agent Ops will own tool boundary, memory, event log, policy gate, and human override.
Eval-first culture
LLM-as-Judge, human-in-the-loop, and offline eval sets become first-line criteria for model selection and release.
Memory & long context
Beyond 1M context, the new design variable is what to keep, summarize, and forget — and how.
Tool / Skill marketplace
MCP, skills, and subagents become OS-level interfaces, with security, billing, and evaluation following along.
Compliance & safety
EU AI Act, ISO 42001, NIST AI RMF — governance-as-code becomes part of the engineering standard.
| What you learned | Portfolio framing |
|---|---|
| HOTL governance | designed human approval and automated policy gates as separate concerns |
| MCP and tool boundary | constrained the agent’s tool surface to be auditable |
| Ralph Loop | implemented repeated execution with deterministic backpressure and rollback |
| Context engineering | designed instruction, task packet, memory, and compaction strategy |
| vLLM / MLOps | served local models with an OpenAI-compatible API plus telemetry |
| LLM-as-Judge | integrated probabilistic evaluation with deterministic gates |
Week 16 is when Weeks 1-15 become one connected story.
| Early concept | How it surfaces in the capstone |
|---|---|
| HOTL | human approval, override, peer review |
| MIG / resource boundary | model serving budget, per-team GPU/queue separation |
| MCP / tool boundary | allowed tools, denied tools, audit log |
| Ralph Loop | repeated runs, deterministic backpressure |
| Context engineering | task packet, instruction, compact replay state |
| Multi-agent SDLC | Lead, Planner, Worker, Reviewer handoff |
| MLOps / telemetry | vLLM metrics, OTel spans, event store |
A team that explains these connections is presenting a synthesis of the whole course, not just a demo.
A worked example so students can copy the structure.
In a personal portfolio, “what I built” is weaker than “what engineering judgments I made.”
## Ralphthon Capstone: Agentic Code Review Harness
- Built a human-on-the-loop agent runtime for repeatable PR review tasks.- Designed task packets, scoped tool permissions, event-sourced run logs, and replay snapshots.- Integrated deterministic gates (pytest, schema validation) with an LLM-as-Judge quality rubric.- Served local coding models through a vLLM OpenAI-compatible API and compared cost/latency against commercial APIs.- Reported 3-run success rate, token cost, latency, and failure recovery behavior.| Beat | Recommended phrasing |
|---|---|
| Situation | ”User Y had problem X with cost Z” |
| Task | ”We solved scope A; out of scope was B” |
| Action | ”Three ADRs + task packet + gate policy” |
| Result | ”Numbers + run id + limitations” |
| Engineering decisions | ”Why this model / harness / evaluation” |
| Next | ”One or two items to cut or harden in the next version” |
The first six months after graduation are the riskiest. Pre-writing milestones keeps direction.
Next 6 months
Add one new model or task type to the capstone system. Read one SRE primer. Accumulate ten ADRs.
Next 1 year
Operate for five real users. Catalog 30 failure modes. Submit one Agent Ops conference talk.
Next 2 years
Move into a platform role on a team. Publish one MLOps + LLMOps + Agent Ops integration writeup.
What was the largest scope cut?
Frame the cut as an engineering decision, not a failure.
Where did the AI repeatedly fail?
Do not blame the model — describe what changed in instruction / gate / context.
What dangerous behavior would have happened without the harness?
Pick one of file writes, external APIs, cost, security, or wrong judgments and be specific.
What is the next version’s first improvement?
Prioritize observability, evaluation, and stability before new features.
This course can be reduced to one sentence.
The stronger the model, the more important the harness around it.
Your capstone may not be a finished product. But if you connected task packet, tool boundary, event log, policy gate, judge, and telemetry yourself, you already treated AI as engineering rather than as prompts.
Deliver the final presentation
15 minutes in STAR order. Live demo + 90-second fallback.
Write peer reviews
For each team, fill the five questions and the 1-3-5 grid.
Answer the four reflection prompts
One paragraph each on scope cut, repeated failures, harness value, and next improvements.
Author a one-page portfolio narrative
Five sentences in STAR + Engineering Decisions form, with headline numbers.
File a continuing-learning roadmap
Save the 6-month / 1-year / 2-year cards in your personal notes.
Due: 2026-06-23 23:59
Peer review:
Personal reflection:
Career / learning roadmap
Outlook / policy
Community
Questions or feedback: yj.lee@chu.ac.kr or GitHub Issue