Module 09 · Lesson 06

Multi-Agent Orchestration, Safety & Evaluation

Reading time: 20 minutes Track: Advanced Claude Setup · Universal Prerequisites: Module 09 · Lesson 05

Why multi-agent matters now

A single Claude session with a 1M token context window and adaptive thinking is already powerful. Multi-agent architectures are the next step: multiple Claude instances working in parallel, each specialized for a sub-task, coordinated by an orchestrating agent.

Anthropic's own research uses a multi-agent architecture where Claude Opus 4.8 orchestrates Claude Sonnet 4.6 workers running in parallel. This architecture consistently outperforms single-agent Claude on complex, multi-step research tasks.

For biotech teams: this is how you build AI systems that can process an entire regulatory dossier, run multiple literature searches simultaneously, or draft and review a document in a single automated workflow.

01 · Orchestrator-worker pattern

The standard multi-agent architecture:

Orchestrator (Opus 4.8 at xhigh effort):

Receives the high-level task
Decomposes it into parallel subtasks
Delegates each subtask to a worker with: objective, output format, tool access, task boundaries
Synthesizes worker results into a final output

Workers (Sonnet 4.6 at medium effort):

Each receives a well-specified subtask
Has access to appropriate tools for that subtask only
Returns structured output in the format specified by the orchestrator
Operates independently — no direct communication between workers

Example decomposition for a regulatory intelligence task:

Orchestrator prompt:

You are orchestrating a competitive regulatory landscape analysis. Decompose this into parallel research tasks. For each task, specify: the exact search query, the expected output format, and what tools to use.

Task: Analyze the FDA approval pathway landscape for CAR-T therapies in pediatric hematological malignancies, 2020-present.

Orchestrator spawns workers for:

Worker 1: Search FDA approval database for relevant approvals
Worker 2: Search published literature for emerging pathway discussions
Worker 3: Extract key regulatory precedents from public FDA guidance documents
Worker 4: Analyze press releases and investor communications for pending submissions

Workers run in parallel, return structured JSON results, orchestrator synthesizes.

02 · Context management across sessions

Complex tasks often span multiple Claude sessions. State management is the hardest part.

Three-tier state management:

git log                     ← what was done (immutable, chronological)
/state/progress.txt         ← what's happening now (freeform, current)
/state/tasks.json           ← what's left to do (structured, machine-readable)

Example tasks.json:

{
  "completed": [
    {"id": 1, "task": "Search FDA database", "status": "done", "output_file": "fda_results.json"},
    {"id": 2, "task": "Literature review", "status": "done", "output_file": "lit_review.md"}
  ],
  "in_progress": [
    {"id": 3, "task": "Synthesize findings", "status": "active"}
  ],
  "pending": [
    {"id": 4, "task": "Prepare executive summary", "status": "waiting"}
  ]
}

Multi-context window workflow:

First context window: Set up the scaffolding — create task.json, initialize output directories, write the test criteria, build setup scripts.

Subsequent context windows: Start each new session by reading state:

Call git log --oneline -10. Review progress.txt and tasks.json. Run init.sh to verify environment. Then continue from the last completed task.

System prompt for continuation sessions:

Your context window will be automatically compacted as it approaches its limit, allowing you to continue working indefinitely. Do not stop tasks early due to token concerns. Before the context window refreshes, save your current progress to progress.txt and update tasks.json.

Biotech application: For overnight regulatory document processing pipelines, use this pattern. Each session picks up where the last left off. The pipeline can be interrupted and resumed without loss of work.

03 · Guarding against overengineering

Claude 4.6 and 4.8 models have a strong tendency to create additional files, add abstractions that weren't requested, and build for hypothetical future requirements. This is a known pattern in agentic coding contexts.

For biotech document workflows where predictability and auditability matter, add this to your system prompt:

Avoid over-engineering. Only make changes that are directly requested or clearly necessary. Keep solutions focused:
- Scope: Don't add features, refactor, or make improvements beyond what was asked
- Documentation: Don't add comments to code you didn't change
- Abstractions: Don't create helpers for one-time operations
- Temporary files: If you create temporary files during iteration, clean them up at the end of the task

Similarly, Claude Opus 4.8 spawns fewer subagents by default than 4.6. If your orchestration architecture requires explicit subagent spawning, guide it:

Spawn multiple subagents in the same turn when fanning out across independent items or reading multiple files simultaneously. Do not spawn a subagent for work you can complete directly in a single response.

04 · Safety architecture: Constitutional Classifiers

Anthropic's enterprise safety system uses a layered approach called Constitutional Classifiers:

Layer 1: All requests pass through a fast probe that examines Claude's internal activations. This screens traffic with minimal latency.

Layer 2: Suspicious exchanges escalate to a full classifier that analyzes both sides of the conversation. This classifier has near-zero universal jailbreak rate.

Layer 3: Asynchronous monitoring that runs detailed evaluation on non-real-time-sensitive traffic using more powerful models.

For enterprise deployments, this is why Anthropic-hosted Claude is preferred over self-hosted alternatives for regulated use cases — the safety architecture is maintained by Anthropic and updated continuously based on red-teaming.

Your application-level guardrails should also include:

Confirm before destructive operations. Add to your system prompt:

Consider the reversibility of your actions. For operations that are hard to reverse — deleting files, modifying production databases, pushing to main, sending external communications — confirm with the user before proceeding.

Enumerate the categories requiring confirmation:

Actions requiring confirmation before proceeding:
- Deleting any file or database record
- Modifying shared infrastructure
- Sending any external message (email, Slack, API call to external service)
- Force-pushing or amending published commits

05 · Evaluation framework

Before you can confidently deploy a Claude-powered workflow in a regulated biotech context, you need a way to measure whether it's actually working. This is the most commonly skipped step in enterprise AI deployments.

Define success criteria before building:

Bad: "The model should classify documents accurately" Good: "The classifier should achieve ≥0.92 F1 on our held-out 500-document test set, with <0.5% false-negative rate on documents containing PHI-adjacent content, and 95th-percentile response time <2 seconds"

Success criteria must be: Specific, Measurable, Achievable, Relevant.

Choose your grading method by complexity:

Task type	Grading method
Classification, yes/no, extraction with known answer	Code-based: exact match or string match
Summarization, translation	ROUGE-L or cosine similarity against reference outputs
Tone, style, regulatory appropriateness	LLM-graded Likert scale
Safety properties (PHI presence, toxicity)	LLM-graded binary classification

LLM-as-judge best practices:

Use a different model for evaluation than generation. Don't grade Opus 4.8 output with Opus 4.8.
Give the judge a detailed rubric, not a vague instruction.
Make the judge produce a reasoning trace before its verdict — this improves evaluation quality: "Think in <thinking> tags, then output 'compliant' or 'non-compliant' in <verdict> tags."
Calibrate your judge against human-labeled examples before trusting it at scale.

Anti-patterns in evaluation:

Using the generation model as the evaluation model — produces systematically biased results
Evaluating on the training distribution only — won't catch edge case failures
Treating passing tests as equivalent to feature-correct behavior — they're not the same

Minimum viable eval for a biotech document workflow: 50-100 representative documents with known correct outputs, evaluated weekly. Track three metrics: task success rate, PHI leakage rate (should be 0%), human reviewer agreement rate. If any metric degrades >5% week-over-week, investigate before scaling.

06 · Module summary

You've covered the five pillars of optimal Claude setup:

Model selection & effort — choose the right model tier; use effort before prompt engineering as the intelligence lever
Context engineering — XML tags, document-first placement, quote-extraction grounding, system prompt contracts
Cost engineering — prompt caching structure, TTL choices, cache monitoring, batch API
Advanced reasoning & tool use — adaptive thinking, interleaved thinking, parallel tool calling, thinking display
Claude Code power setup — CLAUDE.md, settings hierarchy, MCP security, Agent Skills, deterministic hooks, managed enforcement

Multi-agent orchestration, safety architecture, and evaluation are the operational context that makes these pillars work at scale in a regulated environment.

07 · Knowledge check

Q1. You're building a multi-agent pipeline where Claude Opus 4.8 orchestrates three Sonnet 4.6 workers running literature searches in parallel. The orchestrator receives 10,000 tokens of static task context plus 2,000 tokens of dynamic query parameters on each run. What is the best caching strategy?

a) No caching — the query parameters change each run, so nothing can be cached b) Cache the 10,000-token static context with a breakpoint before the dynamic parameters c) Cache the entire prompt including dynamic parameters using a 5-minute TTL d) Cache using the Batch API to reduce costs by 50%

Q2. Your Claude Code deployment needs to prevent any file write to paths containing "patient" or "phi" in the filename, for all developers, with no override. Which mechanism enforces this?

a) CLAUDE.md instruction: "Never write files to paths containing 'patient' or 'phi'" b) A PreToolUse hook in .claude/settings.json committed to the repo c) A PreToolUse hook in managed settings with allowManagedHooksOnly: true d) MCP server configuration restricting filesystem access

Q3. After deploying a regulatory document classification system, you notice the LLM-graded evaluation shows unusually high accuracy. You suspect the evaluation is biased. What is the most likely cause?

a) The evaluation test set is too small b) You're using the same Claude model to both generate and evaluate the classifications c) The LLM judge is using too strict a rubric d) The evaluation is running on the wrong document type

Answers: Q1: b · Q2: c · Q3: b

End of Lesson 06 and Module 09.