Week 9: Implementing the Planner and QA Agents
Theory
Section titled “Theory”The two pillars of the multi-agent SDLC pipeline are the planner agent at the pipeline’s entry point and the QA agent at the validation gate. In the 9-phase pipeline designed in Week 7, Phases 1–3 belong to the planner and Phase 5 to the QA. This week we implement both agents in sequence — from the planner that generates spec.md to the QA that validates independently using only code and tests.
Why the Planner Agent Is the Pipeline Bottleneck
Section titled “Why the Planner Agent Is the Pipeline Bottleneck”Phases 1–3 of the 9-phase agentic SDLC designed in Week 7 (requirements → architecture → task decomposition) are entirely the planner’s domain. The quality of these three phases determines the success rate of Phases 4–9.
Intuitively: no matter how capable the coder agent is, if the input spec.md just says “Add user authentication” in one line, it cannot produce correct code. Conversely, if the acceptance criteria are specified at a testable level, even a smaller model can generate adequate code.
MetaGPT (ICLR 2024) demonstrates this empirically. In the PM → Architect → Engineer sequence, the quality of each role’s SOP (Standard Operating Procedure) documents has a correlation of 0.72 with the final code quality. The clarity of the PRD written by the PM role was the strongest predictor.
2-Phase Separation Pattern: /spec → /architect
Section titled “2-Phase Separation Pattern: /spec → /architect”sdlc-toolkit deliberately separates requirement specification from architecture design, applying the software-engineering principle of Separation of Concerns to agent design.
/spec— “what”: requirements, acceptance criteria, constraints, out-of-scope items. No implementation details./architect— “how”: architectural decisions, module decomposition, TASK file generation, dependency DAG.
The practical benefit: during the spec phase, the LLM focuses on requirement completeness instead of getting lost in implementation details. The architect phase then takes the fixed requirements and searches for the optimal structure.
Planner Agent Implementation
Section titled “Planner Agent Implementation”import anthropicimport jsonfrom pathlib import Path
SYSTEM_PROMPT = """You are a planner agent playing the role of software architect.Analyze the given requirement and codebase and produce a concrete, executable task list.
Output format: JSON{ "tasks": [ { "id": "task-XXX", "description": "concrete implementation target", "target_files": ["file paths"], "dependencies": ["task-XXX"], "acceptance_criteria": ["verifiable conditions"], "assumptions": ["assumptions made"] } ], "out_of_scope": ["out-of-scope items"]}"""
class PlannerAgent: def __init__(self): self.client = anthropic.Anthropic()
def analyze_codebase(self, project_root: Path) -> str: structure = [] for f in project_root.rglob("*.py"): structure.append(f"- {f.relative_to(project_root)}") return "\n".join(structure)
def plan(self, requirement: str, project_root: Path) -> dict: codebase = self.analyze_codebase(project_root) response = self.client.messages.create( model="claude-opus-4-6", max_tokens=4096, system=SYSTEM_PROMPT, messages=[{ "role": "user", "content": f"Requirement: {requirement}\n\nCodebase:\n{codebase}" }] ) return json.loads(response.content[0].text)
def generate_spec_md(self, plan: dict, output_path: Path): """Convert the JSON plan into a Markdown specification.""" with open(output_path, 'w') as f: f.write("# Project Specification\n\n") if plan.get('out_of_scope'): f.write("## Out of Scope\n") for item in plan['out_of_scope']: f.write(f"- {item}\n") f.write("\n") for task in plan['tasks']: f.write(f"## {task['id']}: {task['description']}\n") f.write(f"- Target files: {', '.join(task['target_files'])}\n") f.write("- Acceptance criteria:\n") for criterion in task['acceptance_criteria']: f.write(f" - [ ] {criterion}\n") f.write("\n")Spec Quality Validation and Dependency DAG
Section titled “Spec Quality Validation and Dependency DAG”Before handing the spec to the coder, automatic validation is required. Additionally, the dependencies array in TASK files determines tier-based parallel execution.
def validate_spec(self, plan: dict) -> tuple[bool, list[str]]: """Automatic spec quality validation — gate before handing off to the coder.""" issues = [] for task in plan['tasks']: if not task.get('acceptance_criteria'): issues.append(f"{task['id']}: missing acceptance_criteria") elif len(task['acceptance_criteria']) < 2: issues.append(f"{task['id']}: acceptance_criteria needs at least 2 entries") if 'assumptions' not in task: issues.append(f"{task['id']}: missing assumptions field") if 'out_of_scope' not in plan: issues.append("spec-level out_of_scope not defined") return len(issues) == 0, issues
def compute_tiers(self, plan: dict) -> dict[str, int]: """Implements the tier concept from Week 7 — detects parallel-executable groups.""" task_map = {t['id']: t for t in plan['tasks']} tiers = {} def get_tier(task_id: str) -> int: if task_id in tiers: return tiers[task_id] deps = task_map[task_id].get('dependencies', []) tiers[task_id] = 1 if not deps else max(get_tier(d) for d in deps) + 1 return tiers[task_id] for task in plan['tasks']: get_tier(task['id']) return tiersWhy the QA Agent Must Be Separate
Section titled “Why the QA Agent Must Be Separate”Do you remember Phase 5: Verify from the multi-agent SDLC pipeline designed in Week 7? The core design principle at the time was “the verification agent must be independent of the generation agent.” This principle is grounded not in intuition but in empirical data.
According to agent system scaling research from DeepMind and MIT, when a verification agent shares context with a generation agent, it inherits the same biases. If the coder wrote code under a certain assumption, a QA agent with the same context will take that assumption for granted and skip verification. This is the “I thought so too” bias.
PwC’s 2025 AI Agent Report offers more specific numbers. In a single-agent structure (coder only), accuracy was around 10%, but adding an independent judge agent raised it to 70% — a 7x improvement. QA agent independence is not optional; it is a prerequisite for system reliability.
In sdlc-toolkit, this is implemented in two stages:
/reflect— self-review: the coder agent first reviews its own output/review— independent review: a separate QA agent evaluates only the code and tests
The reason for having an independent review even after self-review is simple. /reflect quickly catches obvious errors and incomplete items to reduce the burden on /review, while /review acts as the final unbiased gate. The two stages serve different purposes.
Independence Principles of the QA Agent
Section titled “Independence Principles of the QA Agent”The QA agent never uses shared context with the coder agent. Two agents sharing the same context share the same biases, making independent verification impossible.
Three mechanisms actually enforce independence:
1. Context Isolation
The QA agent cannot see the coder’s reasoning trace, intermediate decision process, or system prompt. Its only input is the code file and test file. The moment QA knows “why this was implemented this way,” it starts accepting the coder’s rationalization. Not knowing leads to more accurate judgment.
2. Tool Restriction
In sdlc-toolkit’s /review stage, the QA agent has no Edit permission. Only Read and Bash (for running tests) are allowed. If QA can fix bugs it finds directly, loose reviews arise from the mindset of “I’ll fix it anyway.” Restricting tools keeps QA focused on discovery, handing fixes back to the coder. Role separation improves quality.
3. Model Tier Separation
Using Claude Code’s model routing feature, a more powerful model can be assigned to the QA agent. When the coder uses claude-sonnet, the QA can use claude-opus. Investing more reasoning capacity in verification than in generation is cost-effective.
QA Agent Implementation
Section titled “QA Agent Implementation”import subprocessimport anthropicfrom pathlib import Path
class QAAgent: def __init__(self): self.client = anthropic.Anthropic()
def run_tests(self, test_dir: str) -> dict: """pytest 실행 및 결과 파싱""" result = subprocess.run( ["python", "-m", "pytest", test_dir, "-v", "--tb=short", "--json-report"], capture_output=True, text=True ) return { "passed": result.returncode == 0, "output": result.stdout, "errors": result.stderr }
def code_review(self, diff: str) -> str: """Claude를 통한 코드 리뷰""" response = self.client.messages.create( model="claude-opus-4-6", max_tokens=2048, system="""당신은 시니어 소프트웨어 엔지니어입니다.코드 diff를 검토하고 다음을 확인하세요:1. 논리적 오류2. 엣지 케이스 미처리3. 보안 취약점4. 성능 문제5. 테스트 누락
출력: JSON {"approved": bool, "issues": [...], "suggestions": [...]}""", messages=[{"role": "user", "content": f"코드 리뷰 요청:\n{diff}"}] ) return response.content[0].text
def review_pr(self, pr_diff: str, test_dir: str) -> dict: """PR 전체 검증""" test_result = self.run_tests(test_dir) review_result = self.code_review(pr_diff)
return { "tests_passed": test_result["passed"], "test_output": test_result["output"], "code_review": review_result, "approved": test_result["passed"] and "approved: true" in review_result.lower() }3-Parallel Reviewer Implementation
Section titled “3-Parallel Reviewer Implementation”We now implement the 3-parallel reviewer pattern designed in Week 7. Three perspectives — Correctness, Quality, and Architecture — are reviewed simultaneously, and a severity-based PASS/FAIL gate delivers the final verdict.
import concurrent.futuresimport anthropicfrom dataclasses import dataclassfrom typing import Literal
@dataclassclass ReviewResult: dimension: str passed: bool severity: Literal["critical", "major", "minor", "info"] issues: list[str] score: int # 0-10
class ParallelReviewer: def __init__(self): self.client = anthropic.Anthropic()
def _call_claude(self, system: str, user: str) -> str: response = self.client.messages.create( model="claude-opus-4-6", max_tokens=1024, system=system, messages=[{"role": "user", "content": user}] ) return response.content[0].text
def review_correctness(self, code: str, tests: str) -> ReviewResult: """정확성 리뷰: 논리 오류, 엣지 케이스, 테스트 충분성""" result = self._call_claude( system="""코드의 정확성만 검토하라. 스타일은 무시한다.확인 항목: 논리 오류, 엣지 케이스 미처리, 테스트 커버리지 공백.출력: JSON {"score": 0-10, "severity": "critical|major|minor|info", "issues": [...]}""", user=f"코드:\n{code}\n\n테스트:\n{tests}" ) import json data = json.loads(result) return ReviewResult( dimension="correctness", passed=data["score"] >= 4 and data["severity"] != "critical", severity=data["severity"], issues=data["issues"], score=data["score"] )
def review_quality(self, code: str) -> ReviewResult: """품질 리뷰: 코딩 컨벤션, 가독성, 유지보수성""" result = self._call_claude( system="""코드 품질만 검토하라. 기능 정확성은 무시한다.확인 항목: 네이밍, 함수 길이, 중복, 주석 충분성.출력: JSON {"score": 0-10, "severity": "critical|major|minor|info", "issues": [...]}""", user=f"코드:\n{code}" ) import json data = json.loads(result) return ReviewResult( dimension="quality", passed=data["score"] >= 4 and data["severity"] != "critical", severity=data["severity"], issues=data["issues"], score=data["score"] )
def review_architecture(self, code: str, context: str) -> ReviewResult: """아키텍처 리뷰: 설계 결정, 의존성, 확장성""" result = self._call_claude( system="""아키텍처 관점에서만 검토하라.확인 항목: 단일 책임 원칙, 의존성 방향, 인터페이스 설계, 확장 가능성.출력: JSON {"score": 0-10, "severity": "critical|major|minor|info", "issues": [...]}""", user=f"컨텍스트:\n{context}\n\n코드:\n{code}" ) import json data = json.loads(result) return ReviewResult( dimension="architecture", passed=data["score"] >= 4 and data["severity"] != "critical", severity=data["severity"], issues=data["issues"], score=data["score"] )
def parallel_review(self, code: str, tests: str, context: str) -> dict: """3-병렬 리뷰 실행 및 결과 통합""" with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: f_correctness = executor.submit(self.review_correctness, code, tests) f_quality = executor.submit(self.review_quality, code) f_architecture = executor.submit(self.review_architecture, code, context)
results = [ f_correctness.result(), f_quality.result(), f_architecture.result() ]
# 심각도 기반 PASS/FAIL 게이트 has_critical = any(r.severity == "critical" for r in results) all_pass = all(r.passed for r in results) avg_score = sum(r.score for r in results) / len(results)
return { "overall_passed": all_pass and not has_critical, "average_score": avg_score, "results": results, "blocking_issues": [ issue for r in results if r.severity == "critical" for issue in r.issues ] }Self-Review Pattern (/reflect)
Section titled “Self-Review Pattern (/reflect)”Before the independent review, the coder agent reviews its own work first. This proactively catches obvious errors and surfaces ambiguous requirements as questions, reducing the burden on the QA agent.
import anthropic
class SelfReflectAgent: """코더 에이전트의 자기 리뷰 — /reflect 패턴 구현"""
def __init__(self): self.client = anthropic.Anthropic()
def reflect(self, code: str, original_requirement: str) -> dict: """구현 결과를 요구사항과 대조하여 자기 검토""" response = self.client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system="""당신은 방금 코드를 작성한 개발자다. 냉정하게 자기 검토하라.
확인 항목:1. 요구사항의 모든 항목이 구현됐는가?2. 명백한 버그나 오타가 있는가?3. 테스트가 실제 요구사항을 검증하는가?4. 모호하거나 가정에 의존한 부분이 있는가?
출력: JSON { "obvious_issues": [...], "ambiguous_requirements": [...], "questions_for_qa": [...], "self_confidence": 0-10}""", messages=[{ "role": "user", "content": f"요구사항:\n{original_requirement}\n\n내가 작성한 코드:\n{code}" }] ) import json return json.loads(response.content[0].text)
def should_proceed_to_review(self, reflect_result: dict) -> bool: """자기 리뷰 결과로 독립 리뷰 진행 여부 결정""" # 명백한 이슈가 있으면 먼저 수정 후 재시도 if reflect_result["obvious_issues"]: return False # 자신감이 너무 낮으면 재작업 if reflect_result["self_confidence"] < 4: return False return TrueFeedback Loop Design
Section titled “Feedback Loop Design”- Failed test cases
- Stack trace
- Code diff
The feedback loop enables automatic recovery, but it also carries the “infinite refinement loop” risk warned about in Week 7. A convergence guarantee design is needed to address this.
Iteration Cap: 3
The loop is limited to a maximum of 3 iterations. Problems not resolved within 3 attempts likely indicate a fundamentally flawed approach by the coder. Continuing automatic fixes can actually introduce more complex bugs.
Escalation Path
1st failure → Automatic fix attempt (Coder re-runs)2nd failure → Detailed feedback + request to revisit requirements3rd failure → Set human intervention flag (recorded in pipeline-state.json)sdlc-toolkit’s /proceed Phase 5 integrates this pattern into three stages: reflect → review → escalate. The qa_iteration_count field in pipeline-state.json tracks the iteration count and determines escalation triggers.
LLM-as-Judge Review Scoring
Section titled “LLM-as-Judge Review Scoring”This is the 4-dimensional scoring system defined in sdlc-toolkit’s llm-review-prompt.md. It converts subjective “good/bad” judgments into quantitative 0-10 scores, enabling pipeline automation. In Week 12’s telemetry system, these scores are aggregated to track overall quality trends across the system.
PASS Criteria: All 4 dimensions ≥ 4 AND 0 Critical issues
{ "review_id": "string", "timestamp": "ISO-8601", "target": { "file": "string", "commit": "string" }, "scores": { "correctness": { "score": 0, "max": 10, "rationale": "string", "issues": [] }, "conventions": { "score": 0, "max": 10, "rationale": "string", "issues": [] }, "test_coverage": { "score": 0, "max": 10, "rationale": "string", "issues": [] }, "security": { "score": 0, "max": 10, "rationale": "string", "issues": [] } }, "critical_issues": [], "verdict": "PASS | FAIL", "feedback_for_coder": "string"}from dataclasses import dataclass
@dataclassclass ReviewScores: correctness: int # 논리 정확성, 엣지 케이스 conventions: int # 코딩 컨벤션, 네이밍 test_coverage: int # 테스트 충분성, 품질 security: int # 보안 취약점, 입력 검증
def verdict(self, critical_issues: list[str]) -> str: """PASS 기준: 전 차원 ≥ 4 AND Critical 이슈 없음""" if critical_issues: return "FAIL" scores = [ self.correctness, self.conventions, self.test_coverage, self.security ] if all(s >= 4 for s in scores): return "PASS" failing = [ name for name, score in zip( ["correctness", "conventions", "test_coverage", "security"], scores ) if score < 4 ] return f"FAIL (low scores: {', '.join(failing)})"
def to_pipeline_state(self) -> dict: """pipeline-state.json 업데이트용 직렬화""" return { "qa_scores": { "correctness": self.correctness, "conventions": self.conventions, "test_coverage": self.test_coverage, "security": self.security, "average": sum([ self.correctness, self.conventions, self.test_coverage, self.security ]) / 4 } }This scoring system is itself an implementation of the LLM-as-Judge pattern by the QA agent. Week 12 covers how to aggregate this score data to build agent performance telemetry.
Full Pipeline Integration
Section titled “Full Pipeline Integration”This is the end-to-end chain running from Week 8 (PlannerAgent) → Coder → Week 9 (QAAgent). The pipeline-state.json designed in Week 7 serves as the central state store tracking completion of each Phase.
Artifact Chain
requirement.md ← Planner input ↓architecture.md ← Planner output (Week 8) ↓TASK-001.md, TASK-002.md ← Coder assignments based on task_queue.json ↓PR (code + tests) ← Coder output ↓review-results.json ← QA output (Week 9) ↓LESSON-001.md ← Record of learned failure patternspipeline-state.json records when each artifact was generated, the responsible agent, and the current Phase. Even if the pipeline is interrupted, you can identify how far it progressed and restart from that point.
In-Class Discussion Questions
Section titled “In-Class Discussion Questions”The goal of discussion is not to find the right answer, but to clarify trade-offs.
Q1. What problems arise if the QA agent is given Edit permission?
“Wouldn’t it be more efficient to fix bugs directly when found?” — Find the logical flaw in this argument. How does QA’s role change when it has Edit permission? Discuss the trade-off between short-term efficiency and long-term reliability.
Q2. If you had to choose only one of the 3-parallel reviewers (correctness, quality, architecture)?
The answer may vary depending on the team’s current situation (startup MVP vs. financial system vs. open-source library). Which dimension is most important in each situation, and why? How can the two dimensions you didn’t choose be compensated for?
Q3. Why was the feedback loop iteration cap set to 3?
Explain in connection with the “infinite refinement loop” risk from Week 7. What problems arise if the cap is reduced to 1? What if it’s raised to 10? Is the number 3 mathematically grounded, or an empirical heuristic?
Q4. What is the relationship between Week 12’s LLM-as-Judge and this week’s QA agent?
The question “Isn’t the QA agent already an LLM-as-Judge?” is legitimate. Find the similarities and differences between the two. What does Week 12 add? How does telemetry and aggregation differ from a simple one-time judgment?
Practicum
Section titled “Practicum”-
Implement the QA Agent — Complete the
QAAgentclass based on the code above -
Automated Code Review Pipeline — Extract git diff → Claude review → Structure results
-
Integrate the Feedback Loop — Automatically reassign to the Coder when QA fails
-
Full Pipeline Integration — Run Planner → Coder → QA end-to-end
Assignment
Section titled “Assignment”Lab 09: QA Agent Implementation
Section titled “Lab 09: QA Agent Implementation”Submission deadline: 2026-05-05 23:59
Requirements:
- Complete
QAAgentimplementation - Automated code review feature (using Claude API)
- Feedback loop implementation (QA failure → Coder re-run)
- Video or log demonstrating the full Planner → Coder → QA 3-stage pipeline end-to-end
Key Takeaways
Section titled “Key Takeaways”-
QA independence = context isolation + tool restriction + separate model tier. Blocking bias inheritance is the core, empirically validated by PwC research showing accuracy improvements from 10% to 70%.
-
2-stage review: /reflect (self-review) → /review (independent review). Self-review proactively eliminates obvious errors, reducing the burden on the independent review.
-
3-parallel reviewers: each specialized in correctness, quality, and architecture. A severity-based PASS/FAIL gate delivers the final verdict, and parallel execution simultaneously obtains all three perspectives without delay.
-
Feedback loop + iteration cap: balance between automatic recovery and infinite loop prevention. A cap of 3 and an escalation path (automatic fix → detailed feedback → human intervention) guarantees convergence.
-
LLM-as-Judge scoring: 4 dimensions (correctness, conventions, test coverage, security) × 0-10 scores. PASS requires all dimensions ≥ 4 AND 0 Critical issues, and this becomes the data source for Week 12 telemetry.
Further Reading
Section titled “Further Reading”-
DeepMind + MIT “Towards a Science of Scaling Agent Systems” — Empirical basis showing that verification agents inherit biases when sharing context with generation agents. The theoretical foundation for designing agent independence.
-
PwC AI Agent Report (2025) — Industry data showing accuracy improves from 10% to 70% when an independent judge agent is added compared to a single-agent setup. Includes analysis of the specific mechanisms behind the 7x improvement.
-
sdlc-toolkit
/review+/reflectofficial documentation — Reference implementation of the production-level 2-stage review pattern. Includes tool restriction settings, model routing, and the pipeline-state.json schema. -
“Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by ChatGPT” (arXiv, 2024) — Analysis of biases and limitations in LLM-based automatic evaluation. Provides practical guidance for improving scoring reliability when designing QA agents.