Skip to content

Week 7: Multi-Agent SDLC Design

Phase 3Week 7AdvancedLecture: 2026-04-14

Through Week 6 we covered single agents — regulating behavior with CLAUDE.md (Week 6), preventing Context Rot (Week 5), and securing iterative quality with the Ralph Loop (Week 4). The question now is: where does the single-agent approach hit its limits?

A joint study by DeepMind and MIT, “Towards a Science of Scaling Agent Systems” (December 2025), provides decisive data:

An unstructured collection of agents (“bag of agents”) amplifies errors by 17.2×. In contrast, centralized coordination reduces error amplification to 4.4×.

The same pattern appears on SWE-Bench Pro — the same model (Claude Opus 4.5) shows a score range from 45.9% to 55.4% depending on the scaffolding. It is the system design wrapping the model, not the model itself, that determines performance.

This week covers the architecture and design principles of multi-agent SDLC. Code implementation happens in Week 8 (Planner Agent) and Week 9 (QA Agent).


Traditional RoleAgentic EquivalentTool Access (MCP)Output Artifact
Product ManagerPlanner AgentWeb search, document readingrequirement.md
Software ArchitectArchitect AgentRepository mapping, dependency analysisarchitecture.md, TASK files
DeveloperCoder Agent (Ralph Loop)File editing, compiler, testsCode changes, PR
QA EngineerQA Agentpytest, diff viewer, linterReview results, severity report
DevOpsDeploy AgentDocker, CI/CD, monitoringDeploy results, smoke tests
Release ManagerCompletion AgentGit merge, taggingship-summary, release notes
Knowledge ManagerRetrospective AgentFile read/writeLESSON files, assumption verification

This role separation is a pattern validated in academia as well:

  • MetaGPT (ICLR 2024): Connects PM, Architect, Project Manager, Engineer, and QA through SOP (Standard Operating Procedure)-based structured documents. Structured document handoffs between roles — not natural-language chat — are the key.
  • ChatDev (ACL 2024, v2.0 January 2026): Demonstrated via chat-based phase execution that role specialization consistently outperforms monolithic prompting.

MULTI-AGENT PIPELINE
Human (HIC)Requirements input
Planner Agent
  • Parse requirements
  • Generate spec.md
  • Determine priorities
Pass spec.md
Initializer Agent
  • Analyze codebase
  • Decompose subtasks
  • Generate init.sh
Pass task_queue.json
Coder Agent × N (Ralph Loop)
  • Execute tasks in parallel
  • Must pass local tests
Create PR
QA Agent
  • Independent code review
  • Run integration tests
  • Regression verification
Approve/Reject
Deploy Agent
  • Staging deployment
  • E2E tests
  • Human final approval (Hard Interrupt)
Production Deployment

What does the diagram above look like when implemented as a real production system? The diagram below visualizes the full pipeline of sdlc-toolkit — knowledge feedback loops, validation gates, and lesson capture.

SDLC Pipeline

Spec-based development lifecycle with knowledge feedback loops

1

/spec

References lessons

Writes a requirements spec based on the feature request.

G

/validate

Validates spec quality before architectural design begins.

2

/architect

References lessons

Designs the architecture and breaks it into detailed tasks (TASKs).

G

/validate

Validates the quality of the architecture and tasks.

3

Implement

Codes tasks in dependency order; independent tasks are processed in parallel.

4

/reflect

References lessons

Conducts a self-review after implementation is complete.

5

/review

Performs a multi-agent code review to ensure quality and correctness.

6

Create & Merge PR

Opens a pull request, passes final review, then merges.

7

/wrapup

Updates deployment and artifacts, then captures lessons learned and assumptions from development.

Lessons Learned

.sdlc/knowledge/lessons/

Captured via /wrapup at the end of every feature development cycle. Each lesson records what happened, why it matters, and when it applies.

Feedback Loop

Creates a continuous improvement cycle by reading lessons before performing work at three key stages:

Validation Gates

Quality checks run between major stages. Up to 3 automatic fix retries are performed before halting the pipeline.

Assumptions

.sdlc/knowledge/assumptions/

Tracked continuously alongside lessons. /architect references this content when making architectural design decisions.

/proceed REQ-xxx

Automatically runs the entire pipeline above in sequence, including validation gates and automatic fix retries.

/bugfix

Lightweight path — skips the spec and architecture stages for fast bug fixes.

The /proceed pipeline of sdlc-toolkit implements a 9-stage gated execution.

PhaseNameAgentGate
0Create WorktreeOrchestratorBranch isolation check
1Validate SpecValidatorRequirements completeness
2Architecture + Task DecompositionArchitectDependency DAG validity
3Validate ArchitectureValidatorPattern compatibility, task coverage
4Implement (parallel)Coder × NEach task AC satisfied
5Verify (Reflect + Review)QAPASS/FAIL verdict
6Create PROrchestratorCI passing
7PR Cleanup + CIOrchestratorLint/test passing
8Wrapup (merge, deploy, knowledge capture)WrapupLESSON file created

Core principle: Each phase only starts after explicitly confirming completion of the previous phase. No skipping allowed.


How are agents in a multi-agent system coordinated? There are three fundamental topologies:

TopologyStructureExamplesError Amplification
CentralizedSingle orchestrator controls sequencingsdlc-toolkit /proceed, Claude Code Agent Tool4.4× (DeepMind)
HierarchicalOrchestrator of orchestratorssdlc-toolkit /sprint (spawns 5 /proceed in parallel)4.4× × management overhead
Distributed (peer-to-peer)Agents communicate directly with each other”bag of agents”17.2× (DeepMind)

How agents access external tools and how agents communicate with each other are different problems:

ProtocolPurposeScaleCore Structure
MCP (Anthropic, 2024)Agent → tool access97M+ monthly SDK downloads, 5,800+ serversServer/Client, Tool/Resource
A2A (Google, 2025)Agent → agent delegationv0.2, 150+ partner orgsTask, Artifact, Agent Card
AG-UI (CopilotKit, 2025)Agent → user UILangGraph, CrewAI, MS integration~16 event types streaming
Artifact Handoff (this week)Agent → agent (file-based)Project localMarkdown/JSON files

These three protocols form the agentic AI protocol stack — often called “the TCP/IP of agentic AI”:

AG-UI ← Agent ↔ User (real-time streaming, approval UI)
A2A ← Agent ↔ Agent (discovery, delegation, task management)
MCP ← Agent ↔ Tools (tool invocation, data source access)

MCP was donated to AAIF under the Linux Foundation (December 2025) and is now the industry standard. A2A v0.2 supports stateless interactions and was enhanced at Google I/O with Agent Engine integration. AG-UI, originated from CopilotKit, is an event-driven protocol standardizing bidirectional streaming (SSE/WebSocket) between agent backends and user frontends.

The artifact handoff covered this week is the simplest yet most deterministic approach — the filesystem serves as the communication channel, making everything debuggable, auditable, and reproducible.


Claude Code Native Multi-Agent — A Lightweight Alternative

Section titled “Claude Code Native Multi-Agent — A Lightweight Alternative”

Before building the full pipeline above from scratch, let’s first understand the lightweight multi-agent tools built into Claude Code. These tools, revealed by Boris Cherny in February 2026, replace each pipeline stage with a single CLI flag.

Terminal window
# Press Shift+Tab to enter Plan Mode
# Draft plan → user confirmation → auto-execute

Pressing Shift+Tab makes Claude Code draft a plan before writing any code. Once the plan is confirmed, it automatically proceeds to implementation. Boris: “Claude 1-shots the implementation when the plan is right.”

This performs what the Planner Agent above does — requirements parsing, spec.md generation, priority assignment — in an interactive conversational flow. When you build PlannerAgent from scratch in Week 8, you’ll understand the internal structure of this process.

Custom Agents — Declarative Role Specialization

Section titled “Custom Agents — Declarative Role Specialization”

Add Markdown files to the .claude/agents/ directory to define specialized agents:

---
name: code-simplifier
description: Code simplification specialist agent
tools: [Read, Edit, Grep, Glob]
---
Review changed code to:
1. Leverage existing reusable functions
2. Remove unnecessary complexity
3. Apply consistent patterns

Agent Teams — Native Team Mode (Experimental)

Section titled “Agent Teams — Native Team Mode (Experimental)”

If Custom Agents define individual roles, Agent Teams provide team coordination. An experimental feature enabled via CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.

AspectSubagents (Agent Tool)Agent Teams
ExecutionChild process within parent sessionIndependent context windows
CommunicationReports only to parentDirect messaging between teammates
DisplayResults returned to parent onlySplit-pane view with each teammate visible

Subagents are like microservice calls; Agent Teams are like a Slack channel — teammates see each other’s progress and communicate directly when needed. This is the closest native tool for implementing this week’s multi-agent pipeline in Claude Code.

This is the role assignment design above — Coder Agent, QA Agent — implemented declaratively in a single .md file. The same principle of MCP-governed tool access applies via the tools field. The 11 skills of sdlc-toolkit (/spec, /architect, /validate, /review, etc.) are production examples that leverage exactly this Skills system.

Terminal window
# Auto-review after code changes
claude /simplify

Parallel agents review changed code simultaneously across three dimensions: reuse, quality, and efficiency. Boris: “It catches the structural issues a senior engineer would flag in the first five minutes of code review.”

/batch — Large-Scale Parallel Execution Engine

Section titled “/batch — Large-Scale Parallel Execution Engine”
Terminal window
# Interactive planning → parallel execution
claude /batch "Migrate logging in src/ to the new structured logger"

/batch operates in three stages:

  1. Interactive planning: Decomposes the task through conversation with the user
  2. Parallel execution: Runs each subtask in an independent worktree in parallel
  3. PR creation: Each agent opens an individual PR after its tests pass

Boris’s team case: 6 parallel agents migrating logging across 14 files. Total: 11 minutes. 5 of 6 PRs merged without changes. The remaining one required human judgment on a conditional logging edge case.

This is the same multi-agent pipeline principle above — Planner → Coder × N → QA — packaged at product level.

Skills System — Packaged Instruction Tuning

Section titled “Skills System — Packaged Instruction Tuning”
Terminal window
# Install a skill (example — verify actual URL from the skill distributor)
mkdir -p ~/.claude/skills/boris
curl -L -o ~/.claude/skills/boris/SKILL.md \
https://example.com/skills/boris/SKILL.md
# Or write your own SKILL.md and place it directly
# Load skill in a session
claude /skills boris

This extends the instruction tuning from Week 6 (adding constraints to PROMPT.md) into reusable packages. Boris’s own 42 tips are packaged as a single skill, loadable in any project.

Full Pipeline vs Native Tools — When to Use Which

Section titled “Full Pipeline vs Native Tools — When to Use Which”
AspectFull Pipeline (Weeks 7-9)Native Tools (Boris)
Setup costHigh — JSON schemas, agent code implementationLow — .md files, CLI flags
FlexibilityUnlimited — custom handoff logic, feedback loopsLimited — within preset capabilities
Inter-agent commsArtifact-based (JSON schema contracts)None — each agent runs independently
VerificationQA agent runs integration tests + code review/simplify catches structural issues only
Error recoveryGated retries (3×) + human escalationNone — manual restart on failure
Best forComplex multi-stage workflows, custom quality criteriaLarge-scale parallel processing of repetitive tasks

Anthropic Managed Agents — A Third Option

Section titled “Anthropic Managed Agents — A Third Option”

Launched in April 2026 as a public beta, Managed Agents offer a third choice between full pipelines and native tools. Agents run on Anthropic’s cloud infrastructure, eliminating the need to build your own agent loop, tool execution, or runtime.

AspectFull PipelineNative ToolsManaged Agents
InfrastructureSelf-builtClaude Code CLIAnthropic cloud
CostAPI tokens onlyAPI tokens only$0.08/session-hour + tokens
IsolationGit worktreesLocal processesCloud sandbox
Best forCustom quality criteria, complex workflowsPersonal dev, repetitive tasksEnterprise deployment, audit trails

Early adopters: Notion, Asana, Sentry, Rakuten. Handles file I/O, command execution, web browsing, and code execution server-side.


The key to inter-agent communication is structured artifacts. Not natural-language messages, but schema-defined files that move between agents.

---
id: REQ-023
title: "Add user authentication feature"
status: draft # draft → approved → in-progress → complete
deployable: true
created: 2026-04-14
updated: 2026-04-14
---
## Description
Implement a JWT-based user authentication system. Includes login, sign-up, and token refresh.
## Acceptance Criteria
- [ ] POST /auth/login endpoint works
- [ ] JWT token issuance and verification
- [ ] Password bcrypt hashing
- [ ] Automatic token refresh on expiry
## Assumptions
- Using PostgreSQL (leverages existing DB connection)
- Token validity: access 15 min, refresh 7 days
## Out of Scope
- OAuth2 social login (separate REQ)
- 2FA (separate REQ)

Generated by Planner Agent → Validator verifies → Architect Agent consumes

Dependency DAG and Parallelization Strategy

Section titled “Dependency DAG and Parallelization Strategy”

The dependencies array in TASK files determines execution order:

Tier 0 (no dependencies): TASK-001, TASK-002 → run concurrently
↓ wait for completion
Tier 1 (depends on Tier 0): TASK-003, TASK-004 → run concurrently
↓ wait for completion
Tier 2 (depends on Tier 1): TASK-005 → run alone

This tier-based parallelization operates in Phase 4 (Implementation) of the pipeline. Independent tasks run in parallel in separate worktrees; tasks with dependencies wait for their predecessors to complete.


A single reviewer has blind spots — a reviewer strong in security misses performance issues; focusing on architecture means overlooking edge cases. Production systems run 3 specialist reviewers in parallel:

ReviewerReview AreaSeverity Criteria
Correctness ReviewerLogic errors, race conditions, security vulnerabilities, edge casesCritical: data loss / security violation
Quality ReviewerNaming, pattern consistency, duplicate code, hardcoded configMajor: maintainability degradation
Architecture ReviewerLayer separation, separation of concerns, test coverage, API complianceMajor: structural debt

Severity scale: Critical > Major > Minor > Nit. Any Critical finding means FAIL — feedback is automatically sent back to the coder.

On top of this 3-parallel review pattern sits a 2-stage structure:

  1. /reflect (self-review): The coder agent reviews its own code first, catching obvious mistakes to reduce the burden on independent review.
  2. /review (independent review): Three reviewers in parallel, with no knowledge of the coder’s reasoning.

Boris’s /simplify is a lightweight version of this pattern — same parallel review principle, but catching only structural issues without domain specialization. This design is implemented in Python in Week 9.


Multi-agent systems have unique failure modes absent in single-agent setups:

Failure ModeDescriptionMitigation Strategy
Context Rot propagationContext lost at each handoff (Week 5 reference)Artifact-based handoffs — structured files preserve context
17× error trapSilent error compounding in unstructured agent networksCentralized coordination + gated validation
Hallucination propagationOne agent’s hallucination becomes the next agent’s ground truthIndependent validation gate at each phase
Infinite refinement loopQA→Coder→QA cycles without convergenceRetry cap (3×) + human escalation
State desynchronizationFile conflicts between parallel agentsGit worktree isolation — each agent has an independent workspace
Cost explosionUncontrolled agent spawningConcurrency cap (5 agents) + model tier routing (exploration: haiku, implementation: sonnet, review: opus)

  1. Explain the mechanism by which “bag of agents” amplifies errors 17.2× in the DeepMind study. Why does structured coordination reduce this to 4.4×?

  2. On SWE-Bench Pro, the same model shows a score difference of 45.9%–55.4% depending on the scaffolding. Use this data to argue the claim that “the harness matters as much as the model.”

  3. What are the trade-offs between passing natural-language messages between agents versus structured artifact (JSON/Markdown) handoffs? In which situations does each approach excel?

  4. In the /proceed pipeline, what is the rationale for escalating to humans after a maximum of 3 retries at each gate? What problems arise if the retry count is raised to 10?

  5. When applying the single-agent instruction tuning from Week 6 (CLAUDE.md) to a multi-agent system, how would you separate common rules from role-specific rules? Reference sdlc-toolkit’s conventions.md (common) and individual SKILL.md (role-specific) structure.


  1. Role Assignment Design

    Given a project specification, design the roles, responsibilities, and MCP tool access permissions for 5 agents (Planner, Architect, Coder, QA, Wrapup).

  2. Define Artifact Schemas

    Define the schema for every artifact passed between agents. Minimum 3 types: requirement spec, task file (with dependency array), pipeline state.

  3. Dependency DAG Design

    Decompose a given requirement into TASK files, draw the dependency graph, and identify tiers that can run in parallel.

  4. Validation Gate Design

    Define the verification checklist for each phase transition. Customize the /validate checklist above to fit your project.

  5. Error Recovery Scenarios

    Document recovery strategies for 3 failure scenarios (test failure, gate exceeding 3 retries, merge conflict).

Submission deadline: 2026-04-21 23:59

Requirements:

  1. 5-stage multi-agent architecture diagram (roles, artifacts, gates included)
  2. JSON schema definitions for inter-agent artifacts (minimum 3 types)
  3. Dependency DAG design and parallelization tier analysis
  4. Validation gate checklist (per phase)
  5. Error recovery strategy document (3 scenarios)

  1. Multi-Agent SDLC = role separation + structured handoffs + gated validation: The core is not simply running multiple agents, but assigning each agent a clear role and artifact contract.
  2. Bag of agents is harmful: DeepMind research — an unstructured agent collection amplifies errors 17.2×. Central coordination reduces this to 4.4×.
  3. The harness matters as much as the model: On SWE-Bench Pro, the same model shows a 10-percentage-point performance difference depending on scaffolding.
  4. Artifacts replace messages: Instead of direct messages between agents, structured files (requirement.md, TASK-xxx.md, pipeline-state.json) carry the handoffs.
  5. Gated pipeline: A validation gate at each phase transition. Maximum 3 retries before human escalation.
  6. Parallelization is controlled by the dependency DAG: Tier 0 (no dependencies) runs in parallel; Tier N (waits for Tier N-1) runs sequentially.
  7. Knowledge management completes the feedback loop: The domain/component tags in LESSON files automatically inject past lessons into future specs and architectures.
  8. 3-layer protocol stack: MCP (agent↔tools) + A2A (agent↔agent) + AG-UI (agent↔user) = the TCP/IP of agentic AI. MCP transitioning to Streamable HTTP (SSE deprecated 2026-06-30).
  9. Managed Agents as a third option: Anthropic cloud-hosted ($0.08/session-hour). Full pipeline (custom) vs native tools (lightweight) vs Managed Agents (enterprise) — a three-way spectrum.