Week 12: Telemetry and LLM-as-Judge
Theory
Section titled “Theory”Telemetry: The Eyes and Ears of HOTL
Section titled “Telemetry: The Eyes and Ears of HOTL”In the HOTL architecture, real-time telemetry is essential for a human supervisor to trust the agents.
# telemetry.py — OpenTelemetry-based agent monitoringfrom opentelemetry import trace, metricsfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.metrics import MeterProvider
tracer = trace.get_tracer("ralph-loop")meter = metrics.get_meter("ralph-loop")
# Key metricsloop_counter = meter.create_counter("ralph.loop.count")token_usage = meter.create_histogram("ralph.tokens.used")success_rate = meter.create_gauge("ralph.success.rate")
def traced_agent_loop(task: str): with tracer.start_as_current_span("agent_loop") as span: span.set_attribute("task.description", task)
# Execute the loop result = run_ralph_loop(task)
# Record metrics loop_counter.add(1, {"status": "success" if result.passed else "failure"}) token_usage.record(result.tokens_used)
span.set_attribute("result.passed", result.passed) span.set_attribute("tokens.used", result.tokens_used)
return resultLLM-as-Judge Evaluation Framework
Section titled “LLM-as-Judge Evaluation Framework”Use an LLM to automatically evaluate code quality, readability, and design patterns that are difficult to verify with automated tests alone.
import anthropic
JUDGE_SYSTEM_PROMPT = """You are a senior software engineer with 10 years of experience.Evaluate the given code on the following criteria, scoring each from 1 to 10:
1. Correctness: Does the code correctly implement the requirements?2. Readability: Is the code easy to read?3. Efficiency: Are there any unnecessary computations?4. Robustness: Are edge cases handled?5. Maintainability: Will the code be easy to modify in the future?
Output format:{ "scores": {"correctness": 8, "readability": 7, ...}, "overall": 7.5, "strengths": ["...", "..."], "improvements": ["...", "..."]}"""
class LLMJudge: def __init__(self): self.client = anthropic.Anthropic()
def evaluate(self, code: str, requirement: str) -> dict: import json response = self.client.messages.create( model="claude-opus-4-6", max_tokens=1024, system=JUDGE_SYSTEM_PROMPT, messages=[{ "role": "user", "content": f"Requirement: {requirement}\n\nCode:\n```python\n{code}\n```" }] ) return json.loads(response.content[0].text)Practicum
Section titled “Practicum”-
OpenTelemetry Integration — Add tracing and metrics to the Ralph Loop
-
Dashboard Setup — Real-time monitoring with Grafana + Prometheus
-
LLM-as-Judge Implementation — Complete the
LLMJudgeclass and evaluate real code -
Cost Optimization Analysis — Analyze the token usage vs code quality trade-off
Assignment
Section titled “Assignment”Lab 11: Telemetry & Lab 12: LLM-as-Judge
Section titled “Lab 11: Telemetry & Lab 12: LLM-as-Judge”Submission deadline: 2026-05-26 23:59
Lab 11 Requirements:
- Ralph Loop with OpenTelemetry integration
- Grafana dashboard screenshot (loop_count, token_usage, success_rate)
Lab 12 Requirements:
- Complete
LLMJudgeimplementation - Automated evaluation results for 10 code samples
- Correlation analysis between LLM Judge and human evaluator scores