Week 7: Multi-Agent SDLC Design
Theory
Section titled “Theory”Why Multi-Agent SDLC in Week 7
Section titled “Why Multi-Agent SDLC in Week 7”Through Week 6 we covered single agents — regulating behavior with CLAUDE.md (Week 6), preventing Context Rot (Week 5), and securing iterative quality with the Ralph Loop (Week 4). The question now is: where does the single-agent approach hit its limits?
A joint study by DeepMind and MIT, “Towards a Science of Scaling Agent Systems” (December 2025), provides decisive data:
An unstructured collection of agents (“bag of agents”) amplifies errors by 17.2×. In contrast, centralized coordination reduces error amplification to 4.4×.
The same pattern appears on SWE-Bench Pro — the same model (Claude Opus 4.5) shows a score range from 45.9% to 55.4% depending on the scaffolding. It is the system design wrapping the model, not the model itself, that determines performance.
This week covers the architecture and design principles of multi-agent SDLC. Code implementation happens in Week 8 (Planner Agent) and Week 9 (QA Agent).
Traditional SDLC → Agentic SDLC
Section titled “Traditional SDLC → Agentic SDLC”| Traditional Role | Agentic Equivalent | Tool Access (MCP) | Output Artifact |
|---|---|---|---|
| Product Manager | Planner Agent | Web search, document reading | requirement.md |
| Software Architect | Architect Agent | Repository mapping, dependency analysis | architecture.md, TASK files |
| Developer | Coder Agent (Ralph Loop) | File editing, compiler, tests | Code changes, PR |
| QA Engineer | QA Agent | pytest, diff viewer, linter | Review results, severity report |
| DevOps | Deploy Agent | Docker, CI/CD, monitoring | Deploy results, smoke tests |
| Release Manager | Completion Agent | Git merge, tagging | ship-summary, release notes |
| Knowledge Manager | Retrospective Agent | File read/write | LESSON files, assumption verification |
This role separation is a pattern validated in academia as well:
- MetaGPT (ICLR 2024): Connects PM, Architect, Project Manager, Engineer, and QA through SOP (Standard Operating Procedure)-based structured documents. Structured document handoffs between roles — not natural-language chat — are the key.
- ChatDev (ACL 2024, v2.0 January 2026): Demonstrated via chat-based phase execution that role specialization consistently outperforms monolithic prompting.
Multi-Agent Pipeline Architecture
Section titled “Multi-Agent Pipeline Architecture”- Parse requirements
- Generate spec.md
- Determine priorities
- Analyze codebase
- Decompose subtasks
- Generate init.sh
- Execute tasks in parallel
- Must pass local tests
- Independent code review
- Run integration tests
- Regression verification
- Staging deployment
- E2E tests
- Human final approval (Hard Interrupt)
Gated Pipeline — A Production Example
Section titled “Gated Pipeline — A Production Example”What does the diagram above look like when implemented as a real production system? The diagram below visualizes the full pipeline of sdlc-toolkit — knowledge feedback loops, validation gates, and lesson capture.
SDLC Pipeline
Spec-based development lifecycle with knowledge feedback loops
/spec
References lessonsWrites a requirements spec based on the feature request.
/validate
Validates spec quality before architectural design begins.
/architect
References lessonsDesigns the architecture and breaks it into detailed tasks (TASKs).
/validate
Validates the quality of the architecture and tasks.
Implement
Codes tasks in dependency order; independent tasks are processed in parallel.
/reflect
References lessonsConducts a self-review after implementation is complete.
/review
Performs a multi-agent code review to ensure quality and correctness.
Create & Merge PR
Opens a pull request, passes final review, then merges.
/wrapup
Updates deployment and artifacts, then captures lessons learned and assumptions from development.
Lessons Learned
Captured via /wrapup at the end of every feature development cycle. Each lesson records what happened, why it matters, and when it applies.
Feedback Loop
Creates a continuous improvement cycle by reading lessons before performing work at three key stages:
← /specPrevents repeating approaches that previously failed← /architectReuses proven patterns and avoids past mistakes← /reflectConfirms that lessons relevant to the current task are applied
Validation Gates
Quality checks run between major stages. Up to 3 automatic fix retries are performed before halting the pipeline.
Assumptions
Tracked continuously alongside lessons. /architect references this content when making architectural design decisions.
/proceed REQ-xxx
Automatically runs the entire pipeline above in sequence, including validation gates and automatic fix retries.
/bugfix
Lightweight path — skips the spec and architecture stages for fast bug fixes.
The /proceed pipeline of sdlc-toolkit implements a 9-stage gated execution.
| Phase | Name | Agent | Gate |
|---|---|---|---|
| 0 | Create Worktree | Orchestrator | Branch isolation check |
| 1 | Validate Spec | Validator | Requirements completeness |
| 2 | Architecture + Task Decomposition | Architect | Dependency DAG validity |
| 3 | Validate Architecture | Validator | Pattern compatibility, task coverage |
| 4 | Implement (parallel) | Coder × N | Each task AC satisfied |
| 5 | Verify (Reflect + Review) | QA | PASS/FAIL verdict |
| 6 | Create PR | Orchestrator | CI passing |
| 7 | PR Cleanup + CI | Orchestrator | Lint/test passing |
| 8 | Wrapup (merge, deploy, knowledge capture) | Wrapup | LESSON file created |
Core principle: Each phase only starts after explicitly confirming completion of the previous phase. No skipping allowed.
{ "req": "REQ-023", "branch": "feat/REQ-023-user-auth", "startedAt": "2026-04-14T10:00:00Z", "completed": false, "currentPhase": 4, "completedPhases": [0, 1, 2, 3], "phaseHistory": [ { "phase": 0, "name": "Create Worktree", "completedAt": "2026-04-14T10:01:00Z" }, { "phase": 1, "name": "Validate Spec", "completedAt": "2026-04-14T10:03:00Z" }, { "phase": 2, "name": "Architect", "completedAt": "2026-04-14T10:08:00Z" }, { "phase": 3, "name": "Validate Architecture", "completedAt": "2026-04-14T10:10:00Z" } ]}pipeline-state.json tracks the progress of the entire pipeline. On interruption, resumes from currentPhase. This file itself is the synchronization mechanism for inter-agent handoffs.
When a validation gate fails:
Failure detected → Automatic fix attempt (attempt 1) ↓ failsRe-validate → Automatic fix attempt (attempt 2) ↓ failsRe-validate → Automatic fix attempt (attempt 3) ↓ fails⚠️ Human escalation — pipeline pausedRationale for the 3-attempt cap: If an agent fails to fix the same error three times, it does not understand the problem. Further attempts only consume tokens without improving quality. PwC research showed that combining a validation loop with a judge agent improved accuracy 7× from 10% to 70% — this is the gate mechanism at work.
Agent Coordination Topologies
Section titled “Agent Coordination Topologies”How are agents in a multi-agent system coordinated? There are three fundamental topologies:
| Topology | Structure | Examples | Error Amplification |
|---|---|---|---|
| Centralized | Single orchestrator controls sequencing | sdlc-toolkit /proceed, Claude Code Agent Tool | 4.4× (DeepMind) |
| Hierarchical | Orchestrator of orchestrators | sdlc-toolkit /sprint (spawns 5 /proceed in parallel) | 4.4× × management overhead |
| Distributed (peer-to-peer) | Agents communicate directly with each other | ”bag of agents” | 17.2× (DeepMind) |
Inter-Agent Communication Protocols
Section titled “Inter-Agent Communication Protocols”How agents access external tools and how agents communicate with each other are different problems:
| Protocol | Purpose | Scale | Core Structure |
|---|---|---|---|
| MCP (Anthropic, 2024) | Agent → tool access | 97M+ monthly SDK downloads, 5,800+ servers | Server/Client, Tool/Resource |
| A2A (Google, 2025) | Agent → agent delegation | v0.2, 150+ partner orgs | Task, Artifact, Agent Card |
| AG-UI (CopilotKit, 2025) | Agent → user UI | LangGraph, CrewAI, MS integration | ~16 event types streaming |
| Artifact Handoff (this week) | Agent → agent (file-based) | Project local | Markdown/JSON files |
These three protocols form the agentic AI protocol stack — often called “the TCP/IP of agentic AI”:
AG-UI ← Agent ↔ User (real-time streaming, approval UI)A2A ← Agent ↔ Agent (discovery, delegation, task management)MCP ← Agent ↔ Tools (tool invocation, data source access)MCP was donated to AAIF under the Linux Foundation (December 2025) and is now the industry standard. A2A v0.2 supports stateless interactions and was enhanced at Google I/O with Agent Engine integration. AG-UI, originated from CopilotKit, is an event-driven protocol standardizing bidirectional streaming (SSE/WebSocket) between agent backends and user frontends.
The artifact handoff covered this week is the simplest yet most deterministic approach — the filesystem serves as the communication channel, making everything debuggable, auditable, and reproducible.
Claude Code Native Multi-Agent — A Lightweight Alternative
Section titled “Claude Code Native Multi-Agent — A Lightweight Alternative”Before building the full pipeline above from scratch, let’s first understand the lightweight multi-agent tools built into Claude Code. These tools, revealed by Boris Cherny in February 2026, replace each pipeline stage with a single CLI flag.
Plan Mode — Built-in Planner Agent
Section titled “Plan Mode — Built-in Planner Agent”# Press Shift+Tab to enter Plan Mode# Draft plan → user confirmation → auto-executePressing Shift+Tab makes Claude Code draft a plan before writing any code. Once the plan is confirmed, it automatically proceeds to implementation. Boris: “Claude 1-shots the implementation when the plan is right.”
This performs what the Planner Agent above does — requirements parsing, spec.md generation, priority assignment — in an interactive conversational flow. When you build PlannerAgent from scratch in Week 8, you’ll understand the internal structure of this process.
Custom Agents — Declarative Role Specialization
Section titled “Custom Agents — Declarative Role Specialization”Add Markdown files to the .claude/agents/ directory to define specialized agents:
---name: code-simplifierdescription: Code simplification specialist agenttools: [Read, Edit, Grep, Glob]---
Review changed code to:1. Leverage existing reusable functions2. Remove unnecessary complexity3. Apply consistent patterns---name: verify-appdescription: Application verification agenttools: [Read, Bash, Grep]---
Verify changes by:1. Confirming all tests pass2. Confirming build succeeds3. Confirming no lint errors# Run with a specific agentclaude --agent code-simplifier "Refactor this module"
# Set default agent in settings.json# { "defaultAgent": "code-simplifier" }Agent Teams — Native Team Mode (Experimental)
Section titled “Agent Teams — Native Team Mode (Experimental)”If Custom Agents define individual roles, Agent Teams provide team coordination. An experimental feature enabled via CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
| Aspect | Subagents (Agent Tool) | Agent Teams |
|---|---|---|
| Execution | Child process within parent session | Independent context windows |
| Communication | Reports only to parent | Direct messaging between teammates |
| Display | Results returned to parent only | Split-pane view with each teammate visible |
Subagents are like microservice calls; Agent Teams are like a Slack channel — teammates see each other’s progress and communicate directly when needed. This is the closest native tool for implementing this week’s multi-agent pipeline in Claude Code.
This is the role assignment design above — Coder Agent, QA Agent — implemented declaratively in a single .md file. The same principle of MCP-governed tool access applies via the tools field. The 11 skills of sdlc-toolkit (/spec, /architect, /validate, /review, etc.) are production examples that leverage exactly this Skills system.
/simplify — Built-in QA Agent
Section titled “/simplify — Built-in QA Agent”# Auto-review after code changesclaude /simplifyParallel agents review changed code simultaneously across three dimensions: reuse, quality, and efficiency. Boris: “It catches the structural issues a senior engineer would flag in the first five minutes of code review.”
/batch — Large-Scale Parallel Execution Engine
Section titled “/batch — Large-Scale Parallel Execution Engine”# Interactive planning → parallel executionclaude /batch "Migrate logging in src/ to the new structured logger"/batch operates in three stages:
- Interactive planning: Decomposes the task through conversation with the user
- Parallel execution: Runs each subtask in an independent worktree in parallel
- PR creation: Each agent opens an individual PR after its tests pass
Boris’s team case: 6 parallel agents migrating logging across 14 files. Total: 11 minutes. 5 of 6 PRs merged without changes. The remaining one required human judgment on a conditional logging edge case.
This is the same multi-agent pipeline principle above — Planner → Coder × N → QA — packaged at product level.
Skills System — Packaged Instruction Tuning
Section titled “Skills System — Packaged Instruction Tuning”# Install a skill (example — verify actual URL from the skill distributor)mkdir -p ~/.claude/skills/boriscurl -L -o ~/.claude/skills/boris/SKILL.md \ https://example.com/skills/boris/SKILL.md
# Or write your own SKILL.md and place it directly# Load skill in a sessionclaude /skills borisThis extends the instruction tuning from Week 6 (adding constraints to PROMPT.md) into reusable packages. Boris’s own 42 tips are packaged as a single skill, loadable in any project.
Full Pipeline vs Native Tools — When to Use Which
Section titled “Full Pipeline vs Native Tools — When to Use Which”| Aspect | Full Pipeline (Weeks 7-9) | Native Tools (Boris) |
|---|---|---|
| Setup cost | High — JSON schemas, agent code implementation | Low — .md files, CLI flags |
| Flexibility | Unlimited — custom handoff logic, feedback loops | Limited — within preset capabilities |
| Inter-agent comms | Artifact-based (JSON schema contracts) | None — each agent runs independently |
| Verification | QA agent runs integration tests + code review | /simplify catches structural issues only |
| Error recovery | Gated retries (3×) + human escalation | None — manual restart on failure |
| Best for | Complex multi-stage workflows, custom quality criteria | Large-scale parallel processing of repetitive tasks |
Anthropic Managed Agents — A Third Option
Section titled “Anthropic Managed Agents — A Third Option”Launched in April 2026 as a public beta, Managed Agents offer a third choice between full pipelines and native tools. Agents run on Anthropic’s cloud infrastructure, eliminating the need to build your own agent loop, tool execution, or runtime.
| Aspect | Full Pipeline | Native Tools | Managed Agents |
|---|---|---|---|
| Infrastructure | Self-built | Claude Code CLI | Anthropic cloud |
| Cost | API tokens only | API tokens only | $0.08/session-hour + tokens |
| Isolation | Git worktrees | Local processes | Cloud sandbox |
| Best for | Custom quality criteria, complex workflows | Personal dev, repetitive tasks | Enterprise deployment, audit trails |
Early adopters: Notion, Asana, Sentry, Rakuten. Handles file I/O, command execution, web browsing, and code execution server-side.
Artifact-Based Handoff Design
Section titled “Artifact-Based Handoff Design”The key to inter-agent communication is structured artifacts. Not natural-language messages, but schema-defined files that move between agents.
---id: REQ-023title: "Add user authentication feature"status: draft # draft → approved → in-progress → completedeployable: truecreated: 2026-04-14updated: 2026-04-14---
## DescriptionImplement a JWT-based user authentication system. Includes login, sign-up, and token refresh.
## Acceptance Criteria- [ ] POST /auth/login endpoint works- [ ] JWT token issuance and verification- [ ] Password bcrypt hashing- [ ] Automatic token refresh on expiry
## Assumptions- Using PostgreSQL (leverages existing DB connection)- Token validity: access 15 min, refresh 7 days
## Out of Scope- OAuth2 social login (separate REQ)- 2FA (separate REQ)Generated by Planner Agent → Validator verifies → Architect Agent consumes
---id: TASK-001title: "JWT token generation module"status: draftparent: REQ-023created: 2026-04-14updated: 2026-04-14dependencies: [] # Tier 0: no dependencies → can run in parallel---
## Files to Create/Modify- `auth/jwt.py` — token creation/verification utility- `tests/test_jwt.py` — unit tests
## Acceptance Criteria- [ ] create_token(payload, secret) → JWT string- [ ] verify_token(token, secret) → payload or exception- [ ] TokenExpiredError raised when verifying an expired token---id: TASK-003title: "Login API endpoint"status: draftparent: REQ-023dependencies: ["TASK-001", "TASK-002"] # Tier 1: waits for TASK-001, 002---Generated by Architect Agent → the dependencies array determines execution order
---id: LESSON-042title: "Rotating JWT secret invalidates existing tokens"domain: "API" # broad areacomponent: "API/auth" # narrow areatags: [security, jwt]req: REQ-023created: 2026-04-14---
## What HappenedRotating the secret key invalidated all previously issued tokens.
## LessonWhen rotating secrets, implement multi-key verification logic so tokensissued with the previous key remain valid during a grace period.
## Applies WhenJWT secret management, key rotation, authentication system changesGenerated by Wrapup Agent → future /spec and /architect grep by domain:API + component:API/auth to automatically reference this lesson
Dependency DAG and Parallelization Strategy
Section titled “Dependency DAG and Parallelization Strategy”The dependencies array in TASK files determines execution order:
Tier 0 (no dependencies): TASK-001, TASK-002 → run concurrently ↓ wait for completionTier 1 (depends on Tier 0): TASK-003, TASK-004 → run concurrently ↓ wait for completionTier 2 (depends on Tier 1): TASK-005 → run aloneThis tier-based parallelization operates in Phase 4 (Implementation) of the pipeline. Independent tasks run in parallel in separate worktrees; tasks with dependencies wait for their predecessors to complete.
Multi-Agent Code Review Design
Section titled “Multi-Agent Code Review Design”A single reviewer has blind spots — a reviewer strong in security misses performance issues; focusing on architecture means overlooking edge cases. Production systems run 3 specialist reviewers in parallel:
| Reviewer | Review Area | Severity Criteria |
|---|---|---|
| Correctness Reviewer | Logic errors, race conditions, security vulnerabilities, edge cases | Critical: data loss / security violation |
| Quality Reviewer | Naming, pattern consistency, duplicate code, hardcoded config | Major: maintainability degradation |
| Architecture Reviewer | Layer separation, separation of concerns, test coverage, API compliance | Major: structural debt |
Severity scale: Critical > Major > Minor > Nit. Any Critical finding means FAIL — feedback is automatically sent back to the coder.
On top of this 3-parallel review pattern sits a 2-stage structure:
/reflect(self-review): The coder agent reviews its own code first, catching obvious mistakes to reduce the burden on independent review./review(independent review): Three reviewers in parallel, with no knowledge of the coder’s reasoning.
Boris’s /simplify is a lightweight version of this pattern — same parallel review principle, but catching only structural issues without domain specialization. This design is implemented in Python in Week 9.
Failure Modes and Risk Management
Section titled “Failure Modes and Risk Management”Multi-agent systems have unique failure modes absent in single-agent setups:
| Failure Mode | Description | Mitigation Strategy |
|---|---|---|
| Context Rot propagation | Context lost at each handoff (Week 5 reference) | Artifact-based handoffs — structured files preserve context |
| 17× error trap | Silent error compounding in unstructured agent networks | Centralized coordination + gated validation |
| Hallucination propagation | One agent’s hallucination becomes the next agent’s ground truth | Independent validation gate at each phase |
| Infinite refinement loop | QA→Coder→QA cycles without convergence | Retry cap (3×) + human escalation |
| State desynchronization | File conflicts between parallel agents | Git worktree isolation — each agent has an independent workspace |
| Cost explosion | Uncontrolled agent spawning | Concurrency cap (5 agents) + model tier routing (exploration: haiku, implementation: sonnet, review: opus) |
In-Class Discussion Questions
Section titled “In-Class Discussion Questions”-
Explain the mechanism by which “bag of agents” amplifies errors 17.2× in the DeepMind study. Why does structured coordination reduce this to 4.4×?
-
On SWE-Bench Pro, the same model shows a score difference of 45.9%–55.4% depending on the scaffolding. Use this data to argue the claim that “the harness matters as much as the model.”
-
What are the trade-offs between passing natural-language messages between agents versus structured artifact (JSON/Markdown) handoffs? In which situations does each approach excel?
-
In the
/proceedpipeline, what is the rationale for escalating to humans after a maximum of 3 retries at each gate? What problems arise if the retry count is raised to 10? -
When applying the single-agent instruction tuning from Week 6 (CLAUDE.md) to a multi-agent system, how would you separate common rules from role-specific rules? Reference sdlc-toolkit’s
conventions.md(common) and individualSKILL.md(role-specific) structure.
Practicum
Section titled “Practicum”-
Role Assignment Design
Given a project specification, design the roles, responsibilities, and MCP tool access permissions for 5 agents (Planner, Architect, Coder, QA, Wrapup).
-
Define Artifact Schemas
Define the schema for every artifact passed between agents. Minimum 3 types: requirement spec, task file (with dependency array), pipeline state.
-
Dependency DAG Design
Decompose a given requirement into TASK files, draw the dependency graph, and identify tiers that can run in parallel.
-
Validation Gate Design
Define the verification checklist for each phase transition. Customize the
/validatechecklist above to fit your project. -
Error Recovery Scenarios
Document recovery strategies for 3 failure scenarios (test failure, gate exceeding 3 retries, merge conflict).
Assignment
Section titled “Assignment”Lab 07: Multi-Agent Pipeline Design
Section titled “Lab 07: Multi-Agent Pipeline Design”Submission deadline: 2026-04-21 23:59
Requirements:
- 5-stage multi-agent architecture diagram (roles, artifacts, gates included)
- JSON schema definitions for inter-agent artifacts (minimum 3 types)
- Dependency DAG design and parallelization tier analysis
- Validation gate checklist (per phase)
- Error recovery strategy document (3 scenarios)
Key Takeaways
Section titled “Key Takeaways”- Multi-Agent SDLC = role separation + structured handoffs + gated validation: The core is not simply running multiple agents, but assigning each agent a clear role and artifact contract.
- Bag of agents is harmful: DeepMind research — an unstructured agent collection amplifies errors 17.2×. Central coordination reduces this to 4.4×.
- The harness matters as much as the model: On SWE-Bench Pro, the same model shows a 10-percentage-point performance difference depending on scaffolding.
- Artifacts replace messages: Instead of direct messages between agents, structured files (
requirement.md,TASK-xxx.md,pipeline-state.json) carry the handoffs. - Gated pipeline: A validation gate at each phase transition. Maximum 3 retries before human escalation.
- Parallelization is controlled by the dependency DAG: Tier 0 (no dependencies) runs in parallel; Tier N (waits for Tier N-1) runs sequentially.
- Knowledge management completes the feedback loop: The
domain/componenttags in LESSON files automatically inject past lessons into future specs and architectures. - 3-layer protocol stack: MCP (agent↔tools) + A2A (agent↔agent) + AG-UI (agent↔user) = the TCP/IP of agentic AI. MCP transitioning to Streamable HTTP (SSE deprecated 2026-06-30).
- Managed Agents as a third option: Anthropic cloud-hosted ($0.08/session-hour). Full pipeline (custom) vs native tools (lightweight) vs Managed Agents (enterprise) — a three-way spectrum.
Further Reading
Section titled “Further Reading”- Towards a Science of Scaling Agent Systems — DeepMind + MIT (2025) — Empirical study demonstrating error amplification in unstructured agent collections (17.2×) and the effect of centralized coordination (4.4×)
- MetaGPT: Meta Programming for Multi-Agent Collaborative Framework (ICLR 2024) — 5-role SOP-based multi-agent framework. The prototype of structured document handoffs.
- ChatDev: Communicative Agents for Software Development (ACL 2024) — Demonstrates the superiority of role specialization via chat-based phase execution
- A2A Protocol Specification — Google (2025) — Agent-to-Agent communication standard. Task/Artifact/Agent Card structure.
- Model Context Protocol — Anthropic — 97M+ monthly SDK downloads. Industry standard for agent → tool access.
- SWE-Bench Pro — Agent coding benchmark. Quantitatively measures performance differences between scaffoldings for the same model.
- AG-UI Protocol — Agent↔user UI real-time streaming protocol. Started at CopilotKit, integrated with LangGraph/CrewAI/Microsoft
- Anthropic Managed Agents — Cloud-hosted agent infrastructure. $0.08/session-hour, public beta (2026-04)
- Claude Code Agent Teams — Native team mode. Independent contexts, direct communication, split-pane display