Skip to content

Week 12: Telemetry and LLM-as-Judge

Phase 4Week 12AdvancedLecture: 2026-05-19

Concepts

Explain why telemetry (trace/metrics/logs) and append-only event stores are separate axes, and articulate which question each one answers in HOTL operations.

Design

Design span structures and event schemas, defining the seven attributes that connect run, task, agent, model, tool, and gate.

Implementation

Emit spans through the OpenTelemetry SDK, write .events.jsonl, and wire an LLM-as-Judge that returns strict JSON and combines with a deterministic gate.

Operations

Run a judge calibration pipeline using Spearman/Pearson correlation, monitor the trust band, and decide how to harden the gate policy when judge bias is suspected.


Observability is the nervous system of HOTL

Section titled “Observability is the nervous system of HOTL”

In a human-on-the-loop system, the human does not approve every tool call. The system runs autonomously inside boundaries, while telemetry and gates expose abnormal behavior.

ObservableQuestionExample
TraceWhere did time go?agent_loop -> model_call -> tool_call -> test
MetricsIs this within normal range?success_rate, token_usage, ttft_ms
LogsWhat happened?permission denied, tool timeout
Event storeCan we reconstruct the run?.events.jsonl, replay snapshot
EvaluationIs the result usable?tests, lint, LLM-as-Judge, human review
Telemetry Flow — Live Monitoring + Audit
User / HOTLsubmits task packet
Agent Runtimestarts span agent.run · appends run.started
▼ execution stage
Tool / Testtool.invoke → tool.result
LLM Judgescores + verdict
▼ fan out to observe + record
OTel Collectorspans · metrics (live)
Event Store.events.jsonl · replay snapshot
Dashboardlive monitoring + audit / replay in one view

Stamping the same run_id on both OTel span attributes and event-log lines lets a dashboard click through to the audit log in one step.

Logging an entire run as one string makes analysis impossible. Use at minimum the following span structure.

Recommended Span Tree for an Agent Run
agent.runroot — shared run_id, final gate.result
└ planner.stepspec / plan generation
└ model.callprompt → completion (model.name attribute)
└ tool.invoketool.name, input attribute
└ tool.resultstatus, latency_ms
└ acceptance.testdeterministic gate result
└ judge.evaluateprobabilistic gate result
└ artifact.writeartifact.path attribute

Each span should share run_id, task_id, agent_role, model, repository, and commit_sha. This is what lets Grafana answer “which model failed in which role.”

Telemetry is not “more logs.” It is a stable set of keys that make analysis possible.

AttributeExampleWhy it matters
run.idrun-20260519-001connects event log, dashboard, and report
task.idcapstone-017compares repeated runs of the same task
agent.roleworkerseparates planner, worker, and reviewer failures
model.namelocal-codercompares model cost and quality
tool.namerun_testsidentifies slow or risky tools
gate.resultpass, revise, failsupports release readiness decisions
artifact.pathartifacts/run-001.patchtraces final outputs

Without these keys, Week 15 reports will have numbers but no connected evidence.

# telemetry.py — OpenTelemetry-based agent monitoring
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
tracer = trace.get_tracer("agent-runtime")
meter = metrics.get_meter("agent-runtime")
loop_counter = meter.create_counter("agent.loop.count")
token_usage = meter.create_histogram("agent.tokens.used")
judge_score = meter.create_histogram("agent.judge.score")
ttft_ms = meter.create_histogram("agent.model.ttft_ms")
def traced_agent_loop(task: dict):
with tracer.start_as_current_span("agent.run") as span:
span.set_attribute("run.id", task["run_id"])
span.set_attribute("task.id", task["id"])
span.set_attribute("task.objective", task["objective"])
span.set_attribute("agent.role", task.get("role", "worker"))
span.set_attribute("model.name", task.get("model", "local-coder"))
result = run_agent(task)
loop_counter.add(1, {"status": "success" if result.passed else "failure"})
token_usage.record(result.tokens_used, {"model": result.model})
judge_score.record(result.judge_overall, {"task": task["id"]})
ttft_ms.record(result.ttft_ms, {"model": result.model})
span.set_attribute("result.passed", result.passed)
span.set_attribute("tokens.used", result.tokens_used)
span.set_attribute("tests.failed", result.tests_failed)
span.set_attribute("gate.result", result.gate_result)
return result

OpenTelemetry is excellent for live monitoring, but it is not the full source of truth for reconstructing an agent run. Agent OS Runtime style append-only event logs preserve the exact sequence.

{"type":"run.started","run_id":"r-001","task_id":"t-017","model":"local-coder","ts":"2026-05-19T09:00:00Z"}
{"type":"tool.invoke","run_id":"r-001","tool":"read_file","input":{"path":"src/app.py"}}
{"type":"tool.result","run_id":"r-001","tool":"read_file","status":"ok","bytes":1842}
{"type":"test.result","run_id":"r-001","command":"pytest","passed":false,"failed":2}
{"type":"judge.result","run_id":"r-001","overall":7.2,"verdict":"revise"}
{"type":"run.closed","run_id":"r-001","status":"failed","reason":"tests_failed"}

The eight event types you handle in the capstone are:

Event typeWhen it firesRequired fieldsReplay meaning
run.startedtask arrives at the workerrun_id, task_id, model, tsstart of a new run
plan.createdplanner produces spec/planrun_id, plan_pathlocks input for downstream steps
tool.invokejust before tool callrun_id, tool, inputrecords intent before side effects
tool.resultafter tool returnsrun_id, tool, statusone line gives the diagnosis on failure
test.resultdeterministic gate outputrun_id, command, passedinput to the release gate
judge.resultLLM-as-Judge outputrun_id, scores, verdictinput to the probabilistic gate
human.overridehuman flips the verdictrun_id, from, to, approver, reasoncore of the audit trail
run.closedrun terminatesrun_id, status, reasonreplay terminator

The replay function reads these events and recomputes the final state.

def replay(events: list[dict]) -> dict:
state = {
"closed": False,
"tools": [],
"tests": [],
"judge": None,
"overrides": [],
}
for event in events:
match event["type"]:
case "tool.result":
state["tools"].append(event)
case "test.result":
state["tests"].append(event)
case "judge.result":
state["judge"] = event
case "human.override":
state["overrides"].append(event)
case "run.closed":
state["closed"] = True
state["status"] = event["status"]
state["reason"] = event.get("reason")
return state
ScenarioWithout an event storeWith an event store
”Why did it fail?“grep logs and guessscroll the run_id; the sequence is right there
”Can I rerun this?“environment drift breaks itreplay() reconstructs deterministically
”Did this model improve?“no comparable baselinererun the same task_id set with another model
”How do I prove this to an auditor?“only narrative reportsevent sequence plus override records

LLM-as-Judge evaluates qualities that deterministic tests miss, such as readability and design fit. It should not replace tests, policy checks, or human ownership.

GateTypeExample
Staticdeterministicruff, mypy, eslint, schema validation
Runtimedeterministicpytest, integration test, smoke test
Policydeterministic/approvalsecret scan, permission boundary
Judgeprobabilisticreadability, maintainability, design fit
Humanfinal authoritycapstone acceptance, production release

Five biases are repeatedly observed in research; recognizing them lets you design safer prompts and gates.

BiasDescriptionSignalMitigation
Length biaslonger answers score highershort correct answers underrated on the gold setrubric demands brevity, normalize by length
Position biasA/B order affects the scorescores flip when order is swappedevaluate both orders and average
Self-preferencejudge prefers its own model familyjudge rates its own outputs higher than peers’cross-judge with another family
Style biasfamiliar formats (numbered lists, headings) winidentical content in plain text scores lowerrubric scores content, not format
Refusal asymmetrysafe refusals are underrateda wrong answer scores higher than a safe refusalsafety becomes a separate deterministic gate

A judge prompt is not a one-time document. Teams need a small gold set to understand where the judge fails.

Judge Calibration Loop
Gold Set (10-30)human-scored samples
Human Scorereference baseline
Blind Scorejudge scores without seeing the human score
CorrelationSpearman / Pearson + p-value
lowRubric / prompt update → retry blind score
stableCalibrated Judge → Production Gate
Override Loghuman overrides feed the next round of gold-set growth
StepActionFailure signal
Build gold setprepare 10 samples already scored by humansexamples are all too good or too bad
Blind scoringjudge scores without seeing human scoresall scores cluster around 7-9
Correlationcompare human and judge scoreslow correlation or one criterion overrated
Error analysisclassify false pass and false fail casesprompt changes do not address the errors
Rubric updateclarify criteria and examplesscoring remains vague

The capstone goal is not a perfect judge. The goal is to know where the judge fails and make the gate policy absorb that limitation.

# correlate.py — human vs judge correlation
from scipy.stats import spearmanr, pearsonr
def correlate(human: list[float], judge: list[float]) -> dict:
rho, rho_p = spearmanr(human, judge)
r, r_p = pearsonr(human, judge)
return {
"spearman_rho": rho,
"spearman_p": rho_p,
"pearson_r": r,
"pearson_p": r_p,
"n": len(human),
}
# Example: a 12-sample gold set
human = [9, 8, 7, 6, 5, 9, 8, 4, 3, 7, 6, 8]
judge = [8.5, 8.0, 7.2, 6.8, 6.0, 8.7, 7.6, 5.5, 4.2, 7.0, 6.5, 7.9]
print(correlate(human, judge))
# Recommended thresholds:
# spearman_rho >= 0.7 with p < 0.05 -> calibrated
# 0.4 <= spearman_rho < 0.7 -> use judge as advisory only
# spearman_rho < 0.4 -> rewrite rubric or prompt
JUDGE_SYSTEM_PROMPT = """You are a senior software engineering evaluator.
Score the submitted change from 1 to 10 for each criterion.
Return strict JSON only.
Criteria:
1. correctness
2. test_quality
3. maintainability
4. robustness
5. observability
"""
def gate(judge: dict, tests_passed: bool) -> str:
if not tests_passed:
return "fail"
if judge["overall"] < 7.0:
return "revise"
if judge["scores"].get("correctness", 0) < 7:
return "revise"
return "pass"

Humans must be able to override tests or judge verdicts, but the override should become an auditable event rather than a silent overwrite.

{"type":"human.override","run_id":"r-001","from":"revise","to":"pass","reason":"known flaky integration test; deterministic unit tests passed","approver":"instructor"}

Overrides preserve human authority and create data for improving next week’s gates.

Each team needs at least 10 samples.

Sample typeWhy it matters
Clearly good codechecks false negatives
Clearly bad codechecks false positives
Tests pass but design is poorshows judge value
Looks good but behavior is wrongprevents judge overtrust
Security or permission issueshows policy gate necessity

Execution Health

Watch run_count, success_rate, retry_count, and failure_reason. Filter by team run_id during class.

Cost and Latency

Compare prompt_tokens, completion_tokens, cache_read_tokens, ttft_ms, and total_latency_ms by model.

Quality Gates

Show tests_passed, judge_overall, human_override, and final_verdict on a single screen.

Dashboard panel definitions (PromQL examples)

Section titled “Dashboard panel definitions (PromQL examples)”

Assume the OpenTelemetry collector exports metrics to Prometheus.

# Panel 1: model-by-model success rate (15-minute window)
sum by (model) (rate(agent_loop_count{status="success"}[15m]))
/
sum by (model) (rate(agent_loop_count[15m]))
# Panel 2: token usage p95 by role
histogram_quantile(0.95, sum by (le, agent_role)
(rate(agent_tokens_used_bucket[5m])))
# Panel 3: judge.overall mean vs deterministic pass rate
avg by (task_id) (agent_judge_score)
-- Panel 4: top 5 failure reasons in the last 24 hours (DuckDB view over events)
SELECT reason, COUNT(*) AS n
FROM events
WHERE type = 'run.closed'
AND status = 'failed'
AND ts > now() - INTERVAL 1 DAY
GROUP BY reason
ORDER BY n DESC
LIMIT 5;

These four panels form the minimum skeleton of the “numbers slide” you will use in the capstone presentation.

  1. Write trace wrappers

    Create spans for agent.run, model.call, tool.invoke, acceptance.test, and judge.evaluate. Stamp the five attributes run.id, task.id, agent.role, model.name, gate.result consistently on every span.

  2. Write an event writer

    Append run.started, tool.invoke, tool.result, test.result, judge.result, and run.closed to .events.jsonl. Document the concurrency strategy (file lock or per-line flush).

  3. Generate a replay snapshot

    Build replay_snapshot.json from the event log and verify that the run is closed. The snapshot must include the final verdict, override status, and total tokens used.

  4. Implement LLM Judge

    Produce strict JSON scores for 10 code samples. Specify the retry logic and fail-safe (e.g., demote to verdict=revise) for schema violations.

  5. Analyze correlation

    Compute Spearman or Pearson correlation between human and judge scores. Revise the rubric where correlation is weak. Report correlation, n, and p-value.

  6. Build a four-panel dashboard

    Build (1) success rate, (2) token p95, (3) judge.overall vs tests_passed, (4) top 5 failure reasons in Grafana or a simple CSV/Streamlit. Each panel needs a one-line caption explaining which decision it supports.

Due: 2026-05-26 23:59

Lab 11 requirements:

  1. Ralph loop or agent harness with OpenTelemetry tracing.
  2. Grafana/Prometheus screenshot or CSV-based dashboard (4+ panels).
  3. Agent OS Runtime-style .events.jsonl.
  4. replay_snapshot.json recalculated from the event log.
  5. Evidence that the seven span attributes (run.id, task.id, agent.role, model.name, tool.name, gate.result, artifact.path) are consistently stamped.

Lab 12 requirements:

  1. LLMJudge that returns strict JSON.
  2. Automated evaluation results for 10 code samples.
  3. Comparison table: deterministic tests vs. LLM Judge vs. human review.
  4. Spearman/Pearson correlation with n and p-value.
  5. Rationale for integrating judge results into a gate policy rather than using them alone.
  6. At least one observed judge bias and the corresponding mitigation.
  1. Telemetry has two axes: live monitoring (OTel) and audit/replay (append-only event store) do not replace each other.
  2. Seven shared attributes make analysis possible: stamp run.id, task.id, agent.role, model.name, tool.name, gate.result, artifact.path on every span and event so dashboards link to the audit log.
  3. The event store is append-only: it is the source of truth for deterministic reconstruction of a run, and even overrides are recorded as events.
  4. LLM Judge is an evaluator, not a test: deterministic gates come first; the judge handles readability, maintainability, and design where tests cannot.
  5. Be aware of five judge biases: length, position, self-preference, style, and refusal asymmetry each demand a specific prompt or gate-design response.
  6. Calibration is not a one-shot: gold set → blind score → correlation → error analysis → rubric update is a loop that runs at least once during the capstone.
  7. The dashboard is a decision tool: four core panels (success rate, token p95, judge vs tests, top failure reasons) form the backbone of the final report.

Foundational

Tools

Papers / reports

  • Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (NeurIPS 2023)
  • Chen et al., “Humans or LLMs as the Judge? A Study on Judgement Biases” (ACL 2024)
  • Liu et al., “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment” (EMNLP 2023)