Concepts
Explain why telemetry (trace/metrics/logs) and append-only event stores are separate axes, and articulate which question each one answers in HOTL operations.
Concepts
Explain why telemetry (trace/metrics/logs) and append-only event stores are separate axes, and articulate which question each one answers in HOTL operations.
Design
Design span structures and event schemas, defining the seven attributes that connect run, task, agent, model, tool, and gate.
Implementation
Emit spans through the OpenTelemetry SDK, write .events.jsonl, and wire an LLM-as-Judge that returns strict JSON and combines with a deterministic gate.
Operations
Run a judge calibration pipeline using Spearman/Pearson correlation, monitor the trust band, and decide how to harden the gate policy when judge bias is suspected.
In a human-on-the-loop system, the human does not approve every tool call. The system runs autonomously inside boundaries, while telemetry and gates expose abnormal behavior.
| Observable | Question | Example |
|---|---|---|
| Trace | Where did time go? | agent_loop -> model_call -> tool_call -> test |
| Metrics | Is this within normal range? | success_rate, token_usage, ttft_ms |
| Logs | What happened? | permission denied, tool timeout |
| Event store | Can we reconstruct the run? | .events.jsonl, replay snapshot |
| Evaluation | Is the result usable? | tests, lint, LLM-as-Judge, human review |
Stamping the same run_id on both OTel span attributes and event-log lines lets a dashboard click through to the audit log in one step.
Logging an entire run as one string makes analysis impossible. Use at minimum the following span structure.
Each span should share run_id, task_id, agent_role, model, repository, and commit_sha. This is what lets Grafana answer “which model failed in which role.”
Telemetry is not “more logs.” It is a stable set of keys that make analysis possible.
| Attribute | Example | Why it matters |
|---|---|---|
run.id | run-20260519-001 | connects event log, dashboard, and report |
task.id | capstone-017 | compares repeated runs of the same task |
agent.role | worker | separates planner, worker, and reviewer failures |
model.name | local-coder | compares model cost and quality |
tool.name | run_tests | identifies slow or risky tools |
gate.result | pass, revise, fail | supports release readiness decisions |
artifact.path | artifacts/run-001.patch | traces final outputs |
Without these keys, Week 15 reports will have numbers but no connected evidence.
# telemetry.py — OpenTelemetry-based agent monitoringfrom opentelemetry import trace, metricsfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.metrics import MeterProvider
tracer = trace.get_tracer("agent-runtime")meter = metrics.get_meter("agent-runtime")
loop_counter = meter.create_counter("agent.loop.count")token_usage = meter.create_histogram("agent.tokens.used")judge_score = meter.create_histogram("agent.judge.score")ttft_ms = meter.create_histogram("agent.model.ttft_ms")
def traced_agent_loop(task: dict): with tracer.start_as_current_span("agent.run") as span: span.set_attribute("run.id", task["run_id"]) span.set_attribute("task.id", task["id"]) span.set_attribute("task.objective", task["objective"]) span.set_attribute("agent.role", task.get("role", "worker")) span.set_attribute("model.name", task.get("model", "local-coder"))
result = run_agent(task)
loop_counter.add(1, {"status": "success" if result.passed else "failure"}) token_usage.record(result.tokens_used, {"model": result.model}) judge_score.record(result.judge_overall, {"task": task["id"]}) ttft_ms.record(result.ttft_ms, {"model": result.model}) span.set_attribute("result.passed", result.passed) span.set_attribute("tokens.used", result.tokens_used) span.set_attribute("tests.failed", result.tests_failed) span.set_attribute("gate.result", result.gate_result) return resultOpenTelemetry is excellent for live monitoring, but it is not the full source of truth for reconstructing an agent run. Agent OS Runtime style append-only event logs preserve the exact sequence.
{"type":"run.started","run_id":"r-001","task_id":"t-017","model":"local-coder","ts":"2026-05-19T09:00:00Z"}{"type":"tool.invoke","run_id":"r-001","tool":"read_file","input":{"path":"src/app.py"}}{"type":"tool.result","run_id":"r-001","tool":"read_file","status":"ok","bytes":1842}{"type":"test.result","run_id":"r-001","command":"pytest","passed":false,"failed":2}{"type":"judge.result","run_id":"r-001","overall":7.2,"verdict":"revise"}{"type":"run.closed","run_id":"r-001","status":"failed","reason":"tests_failed"}The eight event types you handle in the capstone are:
| Event type | When it fires | Required fields | Replay meaning |
|---|---|---|---|
run.started | task arrives at the worker | run_id, task_id, model, ts | start of a new run |
plan.created | planner produces spec/plan | run_id, plan_path | locks input for downstream steps |
tool.invoke | just before tool call | run_id, tool, input | records intent before side effects |
tool.result | after tool returns | run_id, tool, status | one line gives the diagnosis on failure |
test.result | deterministic gate output | run_id, command, passed | input to the release gate |
judge.result | LLM-as-Judge output | run_id, scores, verdict | input to the probabilistic gate |
human.override | human flips the verdict | run_id, from, to, approver, reason | core of the audit trail |
run.closed | run terminates | run_id, status, reason | replay terminator |
The replay function reads these events and recomputes the final state.
def replay(events: list[dict]) -> dict: state = { "closed": False, "tools": [], "tests": [], "judge": None, "overrides": [], } for event in events: match event["type"]: case "tool.result": state["tools"].append(event) case "test.result": state["tests"].append(event) case "judge.result": state["judge"] = event case "human.override": state["overrides"].append(event) case "run.closed": state["closed"] = True state["status"] = event["status"] state["reason"] = event.get("reason") return state| Scenario | Without an event store | With an event store |
|---|---|---|
| ”Why did it fail?“ | grep logs and guess | scroll the run_id; the sequence is right there |
| ”Can I rerun this?“ | environment drift breaks it | replay() reconstructs deterministically |
| ”Did this model improve?“ | no comparable baseline | rerun the same task_id set with another model |
| ”How do I prove this to an auditor?“ | only narrative reports | event sequence plus override records |
LLM-as-Judge evaluates qualities that deterministic tests miss, such as readability and design fit. It should not replace tests, policy checks, or human ownership.
| Gate | Type | Example |
|---|---|---|
| Static | deterministic | ruff, mypy, eslint, schema validation |
| Runtime | deterministic | pytest, integration test, smoke test |
| Policy | deterministic/approval | secret scan, permission boundary |
| Judge | probabilistic | readability, maintainability, design fit |
| Human | final authority | capstone acceptance, production release |
Five biases are repeatedly observed in research; recognizing them lets you design safer prompts and gates.
| Bias | Description | Signal | Mitigation |
|---|---|---|---|
| Length bias | longer answers score higher | short correct answers underrated on the gold set | rubric demands brevity, normalize by length |
| Position bias | A/B order affects the score | scores flip when order is swapped | evaluate both orders and average |
| Self-preference | judge prefers its own model family | judge rates its own outputs higher than peers’ | cross-judge with another family |
| Style bias | familiar formats (numbered lists, headings) win | identical content in plain text scores lower | rubric scores content, not format |
| Refusal asymmetry | safe refusals are underrated | a wrong answer scores higher than a safe refusal | safety becomes a separate deterministic gate |
A judge prompt is not a one-time document. Teams need a small gold set to understand where the judge fails.
| Step | Action | Failure signal |
|---|---|---|
| Build gold set | prepare 10 samples already scored by humans | examples are all too good or too bad |
| Blind scoring | judge scores without seeing human scores | all scores cluster around 7-9 |
| Correlation | compare human and judge scores | low correlation or one criterion overrated |
| Error analysis | classify false pass and false fail cases | prompt changes do not address the errors |
| Rubric update | clarify criteria and examples | scoring remains vague |
The capstone goal is not a perfect judge. The goal is to know where the judge fails and make the gate policy absorb that limitation.
# correlate.py — human vs judge correlationfrom scipy.stats import spearmanr, pearsonr
def correlate(human: list[float], judge: list[float]) -> dict: rho, rho_p = spearmanr(human, judge) r, r_p = pearsonr(human, judge) return { "spearman_rho": rho, "spearman_p": rho_p, "pearson_r": r, "pearson_p": r_p, "n": len(human), }
# Example: a 12-sample gold sethuman = [9, 8, 7, 6, 5, 9, 8, 4, 3, 7, 6, 8]judge = [8.5, 8.0, 7.2, 6.8, 6.0, 8.7, 7.6, 5.5, 4.2, 7.0, 6.5, 7.9]print(correlate(human, judge))# Recommended thresholds:# spearman_rho >= 0.7 with p < 0.05 -> calibrated# 0.4 <= spearman_rho < 0.7 -> use judge as advisory only# spearman_rho < 0.4 -> rewrite rubric or promptJUDGE_SYSTEM_PROMPT = """You are a senior software engineering evaluator.Score the submitted change from 1 to 10 for each criterion.Return strict JSON only.
Criteria:1. correctness2. test_quality3. maintainability4. robustness5. observability"""
def gate(judge: dict, tests_passed: bool) -> str: if not tests_passed: return "fail" if judge["overall"] < 7.0: return "revise" if judge["scores"].get("correctness", 0) < 7: return "revise" return "pass"Humans must be able to override tests or judge verdicts, but the override should become an auditable event rather than a silent overwrite.
{"type":"human.override","run_id":"r-001","from":"revise","to":"pass","reason":"known flaky integration test; deterministic unit tests passed","approver":"instructor"}Overrides preserve human authority and create data for improving next week’s gates.
Each team needs at least 10 samples.
| Sample type | Why it matters |
|---|---|
| Clearly good code | checks false negatives |
| Clearly bad code | checks false positives |
| Tests pass but design is poor | shows judge value |
| Looks good but behavior is wrong | prevents judge overtrust |
| Security or permission issue | shows policy gate necessity |
Execution Health
Watch run_count, success_rate, retry_count, and failure_reason. Filter by team run_id during class.
Cost and Latency
Compare prompt_tokens, completion_tokens, cache_read_tokens, ttft_ms, and total_latency_ms by model.
Quality Gates
Show tests_passed, judge_overall, human_override, and final_verdict on a single screen.
Assume the OpenTelemetry collector exports metrics to Prometheus.
# Panel 1: model-by-model success rate (15-minute window)sum by (model) (rate(agent_loop_count{status="success"}[15m])) /sum by (model) (rate(agent_loop_count[15m]))
# Panel 2: token usage p95 by rolehistogram_quantile(0.95, sum by (le, agent_role) (rate(agent_tokens_used_bucket[5m])))
# Panel 3: judge.overall mean vs deterministic pass rateavg by (task_id) (agent_judge_score)-- Panel 4: top 5 failure reasons in the last 24 hours (DuckDB view over events)SELECT reason, COUNT(*) AS nFROM eventsWHERE type = 'run.closed' AND status = 'failed' AND ts > now() - INTERVAL 1 DAYGROUP BY reasonORDER BY n DESCLIMIT 5;These four panels form the minimum skeleton of the “numbers slide” you will use in the capstone presentation.
Write trace wrappers
Create spans for agent.run, model.call, tool.invoke, acceptance.test, and judge.evaluate. Stamp the five attributes run.id, task.id, agent.role, model.name, gate.result consistently on every span.
Write an event writer
Append run.started, tool.invoke, tool.result, test.result, judge.result, and run.closed to .events.jsonl. Document the concurrency strategy (file lock or per-line flush).
Generate a replay snapshot
Build replay_snapshot.json from the event log and verify that the run is closed. The snapshot must include the final verdict, override status, and total tokens used.
Implement LLM Judge
Produce strict JSON scores for 10 code samples. Specify the retry logic and fail-safe (e.g., demote to verdict=revise) for schema violations.
Analyze correlation
Compute Spearman or Pearson correlation between human and judge scores. Revise the rubric where correlation is weak. Report correlation, n, and p-value.
Build a four-panel dashboard
Build (1) success rate, (2) token p95, (3) judge.overall vs tests_passed, (4) top 5 failure reasons in Grafana or a simple CSV/Streamlit. Each panel needs a one-line caption explaining which decision it supports.
Due: 2026-05-26 23:59
Lab 11 requirements:
.events.jsonl.replay_snapshot.json recalculated from the event log.Lab 12 requirements:
LLMJudge that returns strict JSON.run.id, task.id, agent.role, model.name, tool.name, gate.result, artifact.path on every span and event so dashboards link to the audit log.Foundational
Tools
Papers / reports