Week 12: Telemetry and LLM-as-Judge

Phase 4Week 12AdvancedLecture: 2026-05-19

Theory

Learning Objectives

Concepts

Explain why telemetry (trace/metrics/logs) and append-only event stores are separate axes, and articulate which question each one answers in HOTL operations.

Design

Design span structures and event schemas, defining the seven attributes that connect run, task, agent, model, tool, and gate.

Implementation

Emit spans through the OpenTelemetry SDK, write .events.jsonl, and wire an LLM-as-Judge that returns strict JSON and combines with a deterministic gate.

Operations

Run a judge calibration pipeline using Spearman/Pearson correlation, monitor the trust band, and decide how to harden the gate policy when judge bias is suspected.

Observability is the nervous system of HOTL

In a human-on-the-loop system, the human does not approve every tool call. The system runs autonomously inside boundaries, while telemetry and gates expose abnormal behavior.

Observable	Question	Example
Trace	Where did time go?	agent_loop -> model_call -> tool_call -> test
Metrics	Is this within normal range?	success_rate, token_usage, ttft_ms
Logs	What happened?	permission denied, tool timeout
Event store	Can we reconstruct the run?	`.events.jsonl`, replay snapshot
Evaluation	Is the result usable?	tests, lint, LLM-as-Judge, human review

Telemetry flow at a glance

Telemetry Flow — Live Monitoring + Audit

User / HOTLsubmits task packet

▼

Agent Runtimestarts span agent.run · appends run.started

▼ execution stage

Tool / Testtool.invoke → tool.result

LLM Judgescores + verdict

▼ fan out to observe + record

OTel Collectorspans · metrics (live)

Event Store.events.jsonl · replay snapshot

▼

Dashboardlive monitoring + audit / replay in one view

Stamping the same run_id on both OTel span attributes and event-log lines lets a dashboard click through to the audit log in one step.

Break agent runs into spans

Logging an entire run as one string makes analysis impossible. Use at minimum the following span structure.

Recommended Span Tree for an Agent Run

agent.runroot — shared run_id, final gate.result

└ planner.stepspec / plan generation

└ model.callprompt → completion (model.name attribute)

└ tool.invoketool.name, input attribute

└ tool.resultstatus, latency_ms

└ acceptance.testdeterministic gate result

└ judge.evaluateprobabilistic gate result

└ artifact.writeartifact.path attribute

Each span should share run_id, task_id, agent_role, model, repository, and commit_sha. This is what lets Grafana answer “which model failed in which role.”

Minimum span attributes

Telemetry is not “more logs.” It is a stable set of keys that make analysis possible.

Attribute	Example	Why it matters
`run.id`	`run-20260519-001`	connects event log, dashboard, and report
`task.id`	`capstone-017`	compares repeated runs of the same task
`agent.role`	`worker`	separates planner, worker, and reviewer failures
`model.name`	`local-coder`	compares model cost and quality
`tool.name`	`run_tests`	identifies slow or risky tools
`gate.result`	`pass`, `revise`, `fail`	supports release readiness decisions
`artifact.path`	`artifacts/run-001.patch`	traces final outputs

Without these keys, Week 15 reports will have numbers but no connected evidence.

# telemetry.py — OpenTelemetry-based agent monitoring
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider

tracer = trace.get_tracer("agent-runtime")
meter = metrics.get_meter("agent-runtime")

loop_counter = meter.create_counter("agent.loop.count")
token_usage = meter.create_histogram("agent.tokens.used")
judge_score = meter.create_histogram("agent.judge.score")
ttft_ms = meter.create_histogram("agent.model.ttft_ms")

def traced_agent_loop(task: dict):
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("run.id", task["run_id"])
        span.set_attribute("task.id", task["id"])
        span.set_attribute("task.objective", task["objective"])
        span.set_attribute("agent.role", task.get("role", "worker"))
        span.set_attribute("model.name", task.get("model", "local-coder"))

        result = run_agent(task)

        loop_counter.add(1, {"status": "success" if result.passed else "failure"})
        token_usage.record(result.tokens_used, {"model": result.model})
        judge_score.record(result.judge_overall, {"task": task["id"]})
        ttft_ms.record(result.ttft_ms, {"model": result.model})
        span.set_attribute("result.passed", result.passed)
        span.set_attribute("tokens.used", result.tokens_used)
        span.set_attribute("tests.failed", result.tests_failed)
        span.set_attribute("gate.result", result.gate_result)
        return result

Event store: replayable operations

OpenTelemetry is excellent for live monitoring, but it is not the full source of truth for reconstructing an agent run. Agent OS Runtime style append-only event logs preserve the exact sequence.

{"type":"run.started","run_id":"r-001","task_id":"t-017","model":"local-coder","ts":"2026-05-19T09:00:00Z"}
{"type":"tool.invoke","run_id":"r-001","tool":"read_file","input":{"path":"src/app.py"}}
{"type":"tool.result","run_id":"r-001","tool":"read_file","status":"ok","bytes":1842}
{"type":"test.result","run_id":"r-001","command":"pytest","passed":false,"failed":2}
{"type":"judge.result","run_id":"r-001","overall":7.2,"verdict":"revise"}
{"type":"run.closed","run_id":"r-001","status":"failed","reason":"tests_failed"}

The eight event types you handle in the capstone are:

Event type	When it fires	Required fields	Replay meaning
`run.started`	task arrives at the worker	run_id, task_id, model, ts	start of a new run
`plan.created`	planner produces spec/plan	run_id, plan_path	locks input for downstream steps
`tool.invoke`	just before tool call	run_id, tool, input	records intent before side effects
`tool.result`	after tool returns	run_id, tool, status	one line gives the diagnosis on failure
`test.result`	deterministic gate output	run_id, command, passed	input to the release gate
`judge.result`	LLM-as-Judge output	run_id, scores, verdict	input to the probabilistic gate
`human.override`	human flips the verdict	run_id, from, to, approver, reason	core of the audit trail
`run.closed`	run terminates	run_id, status, reason	replay terminator

The replay function reads these events and recomputes the final state.

def replay(events: list[dict]) -> dict:
    state = {
        "closed": False,
        "tools": [],
        "tests": [],
        "judge": None,
        "overrides": [],
    }
    for event in events:
        match event["type"]:
            case "tool.result":
                state["tools"].append(event)
            case "test.result":
                state["tests"].append(event)
            case "judge.result":
                state["judge"] = event
            case "human.override":
                state["overrides"].append(event)
            case "run.closed":
                state["closed"] = True
                state["status"] = event["status"]
                state["reason"] = event.get("reason")
    return state

What append-only buys you in operations

Scenario	Without an event store	With an event store
”Why did it fail?“	grep logs and guess	scroll the run_id; the sequence is right there
”Can I rerun this?“	environment drift breaks it	`replay()` reconstructs deterministically
”Did this model improve?“	no comparable baseline	rerun the same task_id set with another model
”How do I prove this to an auditor?“	only narrative reports	event sequence plus override records

LLM-as-Judge is an evaluator, not a test

LLM-as-Judge evaluates qualities that deterministic tests miss, such as readability and design fit. It should not replace tests, policy checks, or human ownership.

Gate	Type	Example
Static	deterministic	ruff, mypy, eslint, schema validation
Runtime	deterministic	pytest, integration test, smoke test
Policy	deterministic/approval	secret scan, permission boundary
Judge	probabilistic	readability, maintainability, design fit
Human	final authority	capstone acceptance, production release

A catalog of judge biases

Five biases are repeatedly observed in research; recognizing them lets you design safer prompts and gates.

Bias	Description	Signal	Mitigation
Length bias	longer answers score higher	short correct answers underrated on the gold set	rubric demands brevity, normalize by length
Position bias	A/B order affects the score	scores flip when order is swapped	evaluate both orders and average
Self-preference	judge prefers its own model family	judge rates its own outputs higher than peers’	cross-judge with another family
Style bias	familiar formats (numbered lists, headings) win	identical content in plain text scores lower	rubric scores content, not format
Refusal asymmetry	safe refusals are underrated	a wrong answer scores higher than a safe refusal	safety becomes a separate deterministic gate

Judge calibration

A judge prompt is not a one-time document. Teams need a small gold set to understand where the judge fails.

Judge Calibration Loop

Gold Set (10-30)human-scored samples

Human Scorereference baseline

▼

Blind Scorejudge scores without seeing the human score

▼

CorrelationSpearman / Pearson + p-value

lowRubric / prompt update → retry blind score

stableCalibrated Judge → Production Gate

▼

Override Loghuman overrides feed the next round of gold-set growth

Step	Action	Failure signal
Build gold set	prepare 10 samples already scored by humans	examples are all too good or too bad
Blind scoring	judge scores without seeing human scores	all scores cluster around 7-9
Correlation	compare human and judge scores	low correlation or one criterion overrated
Error analysis	classify false pass and false fail cases	prompt changes do not address the errors
Rubric update	clarify criteria and examples	scoring remains vague

The capstone goal is not a perfect judge. The goal is to know where the judge fails and make the gate policy absorb that limitation.

Spearman / Pearson correlation

# correlate.py — human vs judge correlation
from scipy.stats import spearmanr, pearsonr

def correlate(human: list[float], judge: list[float]) -> dict:
    rho, rho_p = spearmanr(human, judge)
    r, r_p = pearsonr(human, judge)
    return {
        "spearman_rho": rho,
        "spearman_p": rho_p,
        "pearson_r": r,
        "pearson_p": r_p,
        "n": len(human),
    }

# Example: a 12-sample gold set
human = [9, 8, 7, 6, 5, 9, 8, 4, 3, 7, 6, 8]
judge = [8.5, 8.0, 7.2, 6.8, 6.0, 8.7, 7.6, 5.5, 4.2, 7.0, 6.5, 7.9]
print(correlate(human, judge))
# Recommended thresholds:
#   spearman_rho >= 0.7 with p < 0.05 -> calibrated
#   0.4 <= spearman_rho < 0.7         -> use judge as advisory only
#   spearman_rho < 0.4                -> rewrite rubric or prompt

Judge prompt and JSON output

JUDGE_SYSTEM_PROMPT = """You are a senior software engineering evaluator.
Score the submitted change from 1 to 10 for each criterion.
Return strict JSON only.

Criteria:
1. correctness
2. test_quality
3. maintainability
4. robustness
5. observability
"""

def gate(judge: dict, tests_passed: bool) -> str:
    if not tests_passed:
        return "fail"
    if judge["overall"] < 7.0:
        return "revise"
    if judge["scores"].get("correctness", 0) < 7:
        return "revise"
    return "pass"

Record human override

Humans must be able to override tests or judge verdicts, but the override should become an auditable event rather than a silent overwrite.

{"type":"human.override","run_id":"r-001","from":"revise","to":"pass","reason":"known flaky integration test; deterministic unit tests passed","approver":"instructor"}

Overrides preserve human authority and create data for improving next week’s gates.

Build an evaluation set

Each team needs at least 10 samples.

Sample type	Why it matters
Clearly good code	checks false negatives
Clearly bad code	checks false positives
Tests pass but design is poor	shows judge value
Looks good but behavior is wrong	prevents judge overtrust
Security or permission issue	shows policy gate necessity

Minimum dashboard layout

Execution Health

Watch run_count, success_rate, retry_count, and failure_reason. Filter by team run_id during class.

Cost and Latency

Compare prompt_tokens, completion_tokens, cache_read_tokens, ttft_ms, and total_latency_ms by model.

Quality Gates

Show tests_passed, judge_overall, human_override, and final_verdict on a single screen.

Dashboard panel definitions (PromQL examples)

Assume the OpenTelemetry collector exports metrics to Prometheus.

# Panel 1: model-by-model success rate (15-minute window)
sum by (model) (rate(agent_loop_count{status="success"}[15m]))
  /
sum by (model) (rate(agent_loop_count[15m]))

# Panel 2: token usage p95 by role
histogram_quantile(0.95, sum by (le, agent_role)
  (rate(agent_tokens_used_bucket[5m])))

# Panel 3: judge.overall mean vs deterministic pass rate
avg by (task_id) (agent_judge_score)

-- Panel 4: top 5 failure reasons in the last 24 hours (DuckDB view over events)
SELECT reason, COUNT(*) AS n
FROM events
WHERE type = 'run.closed'
  AND status = 'failed'
  AND ts > now() - INTERVAL 1 DAY
GROUP BY reason
ORDER BY n DESC
LIMIT 5;

These four panels form the minimum skeleton of the “numbers slide” you will use in the capstone presentation.

Practicum

Write trace wrappers

Create spans for agent.run, model.call, tool.invoke, acceptance.test, and judge.evaluate. Stamp the five attributes run.id, task.id, agent.role, model.name, gate.result consistently on every span.
Write an event writer

Append run.started, tool.invoke, tool.result, test.result, judge.result, and run.closed to .events.jsonl. Document the concurrency strategy (file lock or per-line flush).
Generate a replay snapshot

Build replay_snapshot.json from the event log and verify that the run is closed. The snapshot must include the final verdict, override status, and total tokens used.
Implement LLM Judge

Produce strict JSON scores for 10 code samples. Specify the retry logic and fail-safe (e.g., demote to verdict=revise) for schema violations.
Analyze correlation

Compute Spearman or Pearson correlation between human and judge scores. Revise the rubric where correlation is weak. Report correlation, n, and p-value.
Build a four-panel dashboard

Build (1) success rate, (2) token p95, (3) judge.overall vs tests_passed, (4) top 5 failure reasons in Grafana or a simple CSV/Streamlit. Each panel needs a one-line caption explaining which decision it supports.

Assignment

Lab 11: Telemetry & Lab 12: LLM-as-Judge

Due: 2026-05-26 23:59

Lab 11 requirements:

Ralph loop or agent harness with OpenTelemetry tracing.
Grafana/Prometheus screenshot or CSV-based dashboard (4+ panels).
Agent OS Runtime-style .events.jsonl.
replay_snapshot.json recalculated from the event log.
Evidence that the seven span attributes (run.id, task.id, agent.role, model.name, tool.name, gate.result, artifact.path) are consistently stamped.

Lab 12 requirements:

LLMJudge that returns strict JSON.
Automated evaluation results for 10 code samples.
Comparison table: deterministic tests vs. LLM Judge vs. human review.
Spearman/Pearson correlation with n and p-value.
Rationale for integrating judge results into a gate policy rather than using them alone.
At least one observed judge bias and the corresponding mitigation.

Key Takeaways

Telemetry has two axes: live monitoring (OTel) and audit/replay (append-only event store) do not replace each other.
Seven shared attributes make analysis possible: stamp run.id, task.id, agent.role, model.name, tool.name, gate.result, artifact.path on every span and event so dashboards link to the audit log.
The event store is append-only: it is the source of truth for deterministic reconstruction of a run, and even overrides are recorded as events.
LLM Judge is an evaluator, not a test: deterministic gates come first; the judge handles readability, maintainability, and design where tests cannot.
Be aware of five judge biases: length, position, self-preference, style, and refusal asymmetry each demand a specific prompt or gate-design response.
Calibration is not a one-shot: gold set → blind score → correlation → error analysis → rubric update is a loop that runs at least once during the capstone.
The dashboard is a decision tool: four core panels (success rate, token p95, judge vs tests, top failure reasons) form the backbone of the final report.