Skip to content

Week 9: Implementing the QA Agent

Phase 3Week 9AdvancedLecture: 2026-04-28

Concepts

Explain the conflict-of-interest a worker faces when reviewing its own code, and define the four independence principles a QA agent must hold.

Design

Design the feedback loop and escalation path: deterministic test → policy gate → LLM-as-Judge → human review.

Implementation

Integrate the QA agent and LLM-as-Judge in code, automating strict-JSON verdicts and retry logic.

Operations

Report judge reliability (correlation), human-override rate, and false-pass rate on a regular cadence and refresh the prompt / rubric.

Do you remember Phase 5: Verify from the multi-agent SDLC pipeline designed in Week 7? The core design principle at the time was “the verification agent must be independent of the generation agent.” This principle is grounded not in intuition but in empirical data.

According to agent system scaling research from DeepMind and MIT, when a verification agent shares context with a generation agent, it inherits the same biases. If the coder wrote code under a certain assumption, a QA agent with the same context may treat that assumption as given and skip verification. Shared context weakens verification independence.

PwC’s 2025 AI Agent Report offers more specific numbers. In a single-agent structure (coder only), accuracy was around 10%, but adding an independent judge agent raised it to 70% — a 7x improvement. QA agent independence is not optional; it is a prerequisite for system reliability.

In sdlc-toolkit, this is implemented in two stages:

  • /reflect — self-review: the coder agent first reviews its own output
  • /review — independent review: a separate QA agent evaluates only the code and tests

The reason for having an independent review even after self-review is simple. /reflect quickly catches obvious errors and incomplete items to reduce the burden on /review, while /review acts as the final unbiased gate. The two stages serve different purposes.


The QA agent never uses shared context with the coder agent. Two agents sharing the same context share the same biases, making independent verification impossible.

Three mechanisms actually enforce independence:

1. Context Isolation

The QA agent cannot see the coder’s reasoning trace, intermediate decision process, or system prompt. Its only input is the code file and test file. The moment QA knows “why this was implemented this way,” it starts accepting the coder’s rationalization. Not knowing leads to more accurate judgment.

2. Tool Restriction

In sdlc-toolkit’s /review stage, the QA agent has no Edit permission. Only Read and Bash (for running tests) are allowed. If QA can fix bugs it finds directly, loose reviews arise from the mindset of “I’ll fix it anyway.” Restricting tools keeps QA focused on discovery, handing fixes back to the coder. Role separation improves quality.

3. Model Tier Separation

Using Claude Code’s model routing feature, a more powerful model can be assigned to the QA agent. When the coder uses claude-sonnet, the QA can use claude-opus. Investing more reasoning capacity in verification than in generation is cost-effective.


qa_agent.py
import os
import subprocess
import anthropic
from pathlib import Path
class QAAgent:
def __init__(self):
self.client = anthropic.Anthropic()
def run_tests(self, test_dir: str) -> dict:
"""pytest 실행 및 결과 파싱"""
result = subprocess.run(
["python", "-m", "pytest", test_dir, "-v", "--tb=short", "--json-report"],
capture_output=True, text=True
)
return {
"passed": result.returncode == 0,
"output": result.stdout,
"errors": result.stderr
}
def code_review(self, diff: str) -> str:
"""Claude를 통한 코드 리뷰"""
response = self.client.messages.create(
model=os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-5"),
max_tokens=2048,
system="""당신은 시니어 소프트웨어 엔지니어입니다.
코드 diff를 검토하고 다음을 확인하세요:
1. 논리적 오류
2. 엣지 케이스 미처리
3. 보안 취약점
4. 성능 문제
5. 테스트 누락
출력: JSON {"approved": bool, "issues": [...], "suggestions": [...]}""",
messages=[{"role": "user", "content": f"코드 리뷰 요청:\n{diff}"}]
)
return response.content[0].text
def review_pr(self, pr_diff: str, test_dir: str) -> dict:
"""PR 전체 검증"""
test_result = self.run_tests(test_dir)
review_result = self.code_review(pr_diff)
return {
"tests_passed": test_result["passed"],
"test_output": test_result["output"],
"code_review": review_result,
"approved": test_result["passed"] and "approved: true" in review_result.lower()
}

We now implement the 3-parallel reviewer pattern designed in Week 7. Three perspectives — Correctness, Quality, and Architecture — are reviewed simultaneously, and a severity-based PASS/FAIL gate delivers the final verdict.

parallel_reviewer.py
import concurrent.futures
import os
import anthropic
from dataclasses import dataclass
from typing import Literal
@dataclass
class ReviewResult:
dimension: str
passed: bool
severity: Literal["critical", "major", "minor", "info"]
issues: list[str]
score: int # 0-10
class ParallelReviewer:
def __init__(self):
self.client = anthropic.Anthropic()
def _call_claude(self, system: str, user: str) -> str:
response = self.client.messages.create(
model=os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-5"),
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user}]
)
return response.content[0].text
def review_correctness(self, code: str, tests: str) -> ReviewResult:
"""정확성 리뷰: 논리 오류, 엣지 케이스, 테스트 충분성"""
result = self._call_claude(
system="""코드의 정확성만 검토하라. 스타일은 무시한다.
확인 항목: 논리 오류, 엣지 케이스 미처리, 테스트 커버리지 공백.
출력: JSON {"score": 0-10, "severity": "critical|major|minor|info", "issues": [...]}""",
user=f"코드:\n{code}\n\n테스트:\n{tests}"
)
import json
data = json.loads(result)
return ReviewResult(
dimension="correctness",
passed=data["score"] >= 4 and data["severity"] != "critical",
severity=data["severity"],
issues=data["issues"],
score=data["score"]
)
def review_quality(self, code: str) -> ReviewResult:
"""품질 리뷰: 코딩 컨벤션, 가독성, 유지보수성"""
result = self._call_claude(
system="""코드 품질만 검토하라. 기능 정확성은 무시한다.
확인 항목: 네이밍, 함수 길이, 중복, 주석 충분성.
출력: JSON {"score": 0-10, "severity": "critical|major|minor|info", "issues": [...]}""",
user=f"코드:\n{code}"
)
import json
data = json.loads(result)
return ReviewResult(
dimension="quality",
passed=data["score"] >= 4 and data["severity"] != "critical",
severity=data["severity"],
issues=data["issues"],
score=data["score"]
)
def review_architecture(self, code: str, context: str) -> ReviewResult:
"""아키텍처 리뷰: 설계 결정, 의존성, 확장성"""
result = self._call_claude(
system="""아키텍처 관점에서만 검토하라.
확인 항목: 단일 책임 원칙, 의존성 방향, 인터페이스 설계, 확장 가능성.
출력: JSON {"score": 0-10, "severity": "critical|major|minor|info", "issues": [...]}""",
user=f"컨텍스트:\n{context}\n\n코드:\n{code}"
)
import json
data = json.loads(result)
return ReviewResult(
dimension="architecture",
passed=data["score"] >= 4 and data["severity"] != "critical",
severity=data["severity"],
issues=data["issues"],
score=data["score"]
)
def parallel_review(self, code: str, tests: str, context: str) -> dict:
"""3-병렬 리뷰 실행 및 결과 통합"""
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
f_correctness = executor.submit(self.review_correctness, code, tests)
f_quality = executor.submit(self.review_quality, code)
f_architecture = executor.submit(self.review_architecture, code, context)
results = [
f_correctness.result(),
f_quality.result(),
f_architecture.result()
]
# 심각도 기반 PASS/FAIL 게이트
has_critical = any(r.severity == "critical" for r in results)
all_pass = all(r.passed for r in results)
avg_score = sum(r.score for r in results) / len(results)
return {
"overall_passed": all_pass and not has_critical,
"average_score": avg_score,
"results": results,
"blocking_issues": [
issue
for r in results if r.severity == "critical"
for issue in r.issues
]
}

Before the independent review, the coder agent reviews its own work first. This proactively catches obvious errors and surfaces ambiguous requirements as questions, reducing the burden on the QA agent.

self_reflect_agent.py
import anthropic
class SelfReflectAgent:
"""코더 에이전트의 자기 리뷰 — /reflect 패턴 구현"""
def __init__(self):
self.client = anthropic.Anthropic()
def reflect(self, code: str, original_requirement: str) -> dict:
"""구현 결과를 요구사항과 대조하여 자기 검토"""
response = self.client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="""당신은 방금 코드를 작성한 개발자다. 냉정하게 자기 검토하라.
확인 항목:
1. 요구사항의 모든 항목이 구현됐는가?
2. 명백한 버그나 오타가 있는가?
3. 테스트가 실제 요구사항을 검증하는가?
4. 모호하거나 가정에 의존한 부분이 있는가?
출력: JSON {
"obvious_issues": [...],
"ambiguous_requirements": [...],
"questions_for_qa": [...],
"self_confidence": 0-10
}""",
messages=[{
"role": "user",
"content": f"요구사항:\n{original_requirement}\n\n내가 작성한 코드:\n{code}"
}]
)
import json
return json.loads(response.content[0].text)
def should_proceed_to_review(self, reflect_result: dict) -> bool:
"""자기 리뷰 결과로 독립 리뷰 진행 여부 결정"""
# 명백한 이슈가 있으면 먼저 수정 후 재시도
if reflect_result["obvious_issues"]:
return False
# 자신감이 너무 낮으면 재작업
if reflect_result["self_confidence"] < 4:
return False
return True

QA FEEDBACK LOOP
QA Agent detects failure
Package failure information
  • Failed test cases
  • Stack trace
  • Code diff
Reassign to Coder AgentAdd new task to task_queue.json (priority: HIGH)
Coder Agent re-runs

The feedback loop enables automatic recovery, but it also carries the “infinite refinement loop” risk warned about in Week 7. A convergence guarantee design is needed to address this.

Iteration Cap: 3

The loop is limited to a maximum of 3 iterations. Problems not resolved within 3 attempts likely indicate a fundamentally flawed approach by the coder. Continuing automatic fixes can actually introduce more complex bugs.

Escalation Path

1st failure → Automatic fix attempt (Coder re-runs)
2nd failure → Detailed feedback + request to revisit requirements
3rd failure → Set human intervention flag (recorded in pipeline-state.json)

sdlc-toolkit’s /proceed Phase 5 integrates this pattern into three stages: reflect → review → escalate. The qa_iteration_count field in pipeline-state.json tracks the iteration count and determines escalation triggers.


This is the 4-dimensional scoring system defined in sdlc-toolkit’s llm-review-prompt.md. It converts subjective “good/bad” judgments into quantitative 0-10 scores, enabling pipeline automation. In Week 12’s telemetry system, these scores are aggregated to track overall quality trends across the system.

PASS Criteria: All 4 dimensions ≥ 4 AND 0 Critical issues

{
"review_id": "string",
"timestamp": "ISO-8601",
"target": {
"file": "string",
"commit": "string"
},
"scores": {
"correctness": {
"score": 0,
"max": 10,
"rationale": "string",
"issues": []
},
"conventions": {
"score": 0,
"max": 10,
"rationale": "string",
"issues": []
},
"test_coverage": {
"score": 0,
"max": 10,
"rationale": "string",
"issues": []
},
"security": {
"score": 0,
"max": 10,
"rationale": "string",
"issues": []
}
},
"critical_issues": [],
"verdict": "PASS | FAIL",
"feedback_for_coder": "string"
}

This scoring system is itself an implementation of the LLM-as-Judge pattern by the QA agent. Week 12 covers how to aggregate this score data to build agent performance telemetry.


This is the end-to-end chain running from Week 8 (PlannerAgent) → Coder → Week 9 (QAAgent). The pipeline-state.json designed in Week 7 serves as the central state store tracking completion of each Phase.

Artifact Chain

ARTIFACT CHAIN
Planner inputrequirement.md
Planner output · Week 8architecture.md
Coder assignment from task_queue.jsonTASK-001.md · TASK-002.md
Coder outputPR (code + tests)
QA output · Week 9review-results.json
Record of learned failure patternsLESSON-001.md
Central State Storepipeline-state.json

Records each artifact’s creation time, responsible agent, and current Phase so the pipeline can restart from the interruption point.

pipeline-state.json records when each artifact was generated, the responsible agent, and the current Phase. Even if the pipeline is interrupted, you can identify how far it progressed and restart from that point.


The goal of discussion is not to find the right answer, but to clarify trade-offs.

Q1. What problems arise if the QA agent is given Edit permission?

“Wouldn’t it be more efficient to fix bugs directly when found?” — Find the logical flaw in this argument. How does QA’s role change when it has Edit permission? Discuss the trade-off between short-term efficiency and long-term reliability.

Q2. If you had to choose only one of the 3-parallel reviewers (correctness, quality, architecture)?

The answer may vary depending on the team’s current situation (startup MVP vs. financial system vs. open-source library). Which dimension is most important in each situation, and why? How can the two dimensions you didn’t choose be compensated for?

Q3. Why was the feedback loop iteration cap set to 3?

Explain in connection with the “infinite refinement loop” risk from Week 7. What problems arise if the cap is reduced to 1? What if it’s raised to 10? Is the number 3 mathematically grounded, or an empirical heuristic?

Q4. What is the relationship between Week 12’s LLM-as-Judge and this week’s QA agent?

The question “Isn’t the QA agent already an LLM-as-Judge?” is legitimate. Find the similarities and differences between the two. What does Week 12 add? How does telemetry and aggregation differ from a simple one-time judgment?


  1. Implement the QA Agent — Complete the QAAgent class based on the code above

  2. Automated Code Review Pipeline — Extract git diff → Claude review → Structure results

  3. Integrate the Feedback Loop — Automatically reassign to the Coder when QA fails

  4. Full Pipeline Integration — Run Planner → Coder → QA end-to-end

Submission deadline: 2026-05-05 23:59

Requirements:

  1. Working QAAgent implementation
  2. Automated code review feature (using Claude API)
  3. Feedback loop implementation (QA failure → Coder re-run)
  4. Video or log demonstrating the full Planner → Coder → QA 3-stage pipeline end-to-end

  1. QA independence = context isolation + tool restriction + separate model tier. Blocking bias inheritance is the core, empirically validated by PwC research showing accuracy improvements from 10% to 70%.

  2. 2-stage review: /reflect (self-review) → /review (independent review). Self-review proactively eliminates obvious errors, reducing the burden on the independent review.

  3. 3-parallel reviewers: each specialized in correctness, quality, and architecture. A severity-based PASS/FAIL gate delivers the final verdict, and parallel execution simultaneously obtains all three perspectives without delay.

  4. Feedback loop + iteration cap: balance between automatic recovery and infinite loop prevention. A cap of 3 and an escalation path (automatic fix → detailed feedback → human intervention) guarantees convergence.

  5. LLM-as-Judge scoring: 4 dimensions (correctness, conventions, test coverage, security) × 0-10 scores. PASS requires all dimensions ≥ 4 AND 0 Critical issues, and this becomes the data source for Week 12 telemetry.


  1. DeepMind + MIT “Towards a Science of Scaling Agent Systems” — Empirical basis showing that verification agents inherit biases when sharing context with generation agents. The theoretical foundation for designing agent independence.

  2. PwC AI Agent Report (2025) — Industry data showing accuracy improves from 10% to 70% when an independent judge agent is added compared to a single-agent setup. Includes analysis of the specific mechanisms behind the 7x improvement.

  3. sdlc-toolkit /review + /reflect official documentation — Reference implementation of the production-level 2-stage review pattern. Includes tool restriction settings, model routing, and the pipeline-state.json schema.

  4. “Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by ChatGPT” (arXiv, 2024) — Analysis of biases and limitations in LLM-based automatic evaluation. Provides practical guidance for improving scoring reliability when designing QA agents.