The decision that used to take six weeks
Last week I needed to decide the architecture for an AI agent: single-agent or multi-agent? The old way would be to build both, deploy an A/B test, measure metrics for 2 weeks, refactor — totalling 6–8 weeks of engineering.
The new way: spawn 3 subagent variants in Claude Code, run the same input, compare outputs side-by-side — the whole thing took an afternoon. The winner went to production.
This isn't a shortcut. It's a methodology that works across AI coding tools — Claude Code, Cursor, OpenCode, Cline. The tool matters less than the principle.
Why the traditional loop wastes weeks
When building an AI agent system, most devs follow this pattern:
- 01Design architecture1 week
- 02Build backend infrastructure2–3 weeks
- 03Build prompt logic1 week
- 04Deploy + test1 week
- 05Find issues → refactor2–4 weeks
- 06Repeat∞
- Total time
- 6–12 weeks / decision
- Risk
- Sunk-cost bias when wrong
The core problem: confusing experimentation with production. Production code is hard to change — schema migrations, test infrastructure, deployment pipelines, rollback complexity. Experimentation needs to be fast, cheap, disposable. These two concerns MUST be separated.
Two runtimes, two jobs
AI coding tools (Claude Code, Cursor, OpenCode) fit the experimentation phase well. Production webservice is the scaling runtime — built only when patterns are proven.
Experimentation runtime
- Free subagent spawning
- Tool calling built-in
- File I/O instant
- Context management automatic
- Disposable experiments
- Real-time iteration
Subagent spawning in subscription-based coding tools doesn't charge tokens like a production API — so iteration cost ≈ $0. Note: pricing models may change per tool/time.
Production runtime
- Built only when patterns proven
- Cheap models with distilled prompts
- 24/7 uptime
- Multi-tenant
- Code-level validators
- Cost-optimized per call
Rule: Validate in experimentation runtime first. Build for production only after patterns are stable.
Four steps from idea to production
Validate workflow with an AI coding tool
Open Claude Code / Cursor / OpenCode. Create a CLAUDE.md with context, build 5–10 simple tools (Python functions), spawn a subagent to run the workflow end-to-end, iterate prompts in real-time.
- Markdown files (skills / SOPs)
- Python functions (tools)
- Sample data (5–10 real examples)
- Backend infrastructure
- Frontend UI
- Database
- Deployment
A/B test with subagents
For architecture decisions: spawn variants in parallel, compare outputs on the same inputs, and pick the pattern that holds up most consistently. No production code is touched during this phase.
- Variant ASingle mega-agent
- Variant BMulti-agent chain
- Variant CSelf-critique loop
Distill with the best model
Use Opus / GPT-5 for the LEARNING phase: run 100+ workflow executions, document what works / what doesn't, extract minimum prompts, build eval sets.
Why does distillation work? Opus reasoning gets packaged into a condensed prompt — the cheaper model inherits that wisdom via instruction instead of re-reasoning from scratch. This is knowledge distillation at the prompt layer, not the model weights layer. Pay the reasoning cost once (learning), apply cheaply infinite times (production).
Note: I use 'distillation' informally — compressing expensive workflow reasoning into reusable prompts. This is not model-weight distillation in the ML sense.
Migrate to production
Once validated: replace AI coding tool → SDK + API, replace expensive model → cheap model with distilled prompts, add code-level validators, deploy infrastructure.
18 spawns in a single session
Question: Cold email agent architecture for project D2 — 3-agent chain vs 1-agent monster vs critique loop?
How we evaluated each variant
Dataset: 3 real prospects — Joshua (VP Engineering, SaaS), Evan (Founder, logistics), Sandra (Head of Growth, fintech). Each variant ran on all 3. Metrics per output:
| Metric | How measured | Type |
|---|---|---|
| Fabrication count | Regex scan for unverifiable claims (numbers, titles) | Deterministic |
| Word-count compliance | len(output) within [180, 260] range | Deterministic |
| Schema validity | JSON keys present + non-empty | Deterministic |
| Tactical framing | Separate agent: does hook address a real pain point? | Semantic |
Sandra (fintech) broke 2 variants that passed Joshua and Evan — which is why you need more than 1 test case before declaring a winner.
Findings from 18 spawns
The subagent self-scores "pass" while the output objectively fails. Don't trust agents to self-evaluate — you need an external validator or an independent second agent.
Enforcing anti-fabrication shrinks the email; loosening it causes the agent to invent numbers. Can't be solved by prompt alone — need a code validator after each stage.
Critical rules buried in a mega-prompt get ignored. Critical rules must live in a dedicated space at the start of the context — not buried mid-instruction.
Hook quality comes from specific wording in the prompt, not from the number of agents. A single-agent with a great prompt beats a 3-agent chain with a lazy one.
Single-agent: context overflow. Multi-agent: coordination overhead. Critique loop: over-polished output loses natural tone. None is universally clean — know the failure modes upfront to design hedges.
Production decision
- 3-agent chain
- Anti-fab in BASE prompt
- Python validators between stages
- QA semantic only
Evaluation is the bottleneck
Generation is straightforward — any tool can spawn variants. The harder problem is deciding which one is actually better, and on what criteria.
In those 18 spawns, the most time-consuming part wasn't spawning variants — it was evaluating them. Generating 7 email variants took about 20 minutes. Deciding which one was best took the other 6 hours.
The evaluation methods we actually used, in order of effort:
Word count, fabrication regex, schema key checks. Fast, cheap, catches obvious failures. Ran first — anything failing here got cut before manual review.
A second agent asked: "Does this hook address a real, specific pain point for this prospect?" Slower and costs tokens, but catches subtler failure modes the regex misses.
Read the surviving variants out loud. The one that doesn't make you cringe is the winner. Unscalable, but for high-stakes output it's irreplaceable.
The lesson: any tool can spawn variants. What separates good experimentation from brute-force prompting is having a way to measure 'better'. No eval = no signal.
Pick what feels comfortable
Method works with all of them. Compact comparison:
| Tool | Vendor | Best for | Price | Notable |
|---|---|---|---|---|
| Claude Code | Anthropic | Claude-centric, async workflows | $20–200/mo |
|
| Cursor | Anysphere | Developers in IDE | $20/mo Pro |
|
| OpenCode | Open Source | Privacy / custom needs | Free + your API |
|
| Cline | VS Code Ext. | VS Code users | Free + API |
|
| Aider | Open Source | CLI lovers | Free + API |
|
From experimentation to production
project/
├── CLAUDE.md # or .cursorrules, AGENTS.md
├── tools/
│ ├── apollo_search.py
│ ├── email_send.py
│ └── storage.py
├── skills/
│ ├── outbound-research.md
│ └── personalization.md
└── data/
└── test_prospects.jsondef apollo_search(
industry: str,
size_range: str,
location: str,
) -> list:
"""Search Apollo for prospects matching criteria."""
# ... implementation
return prospectsSpawn 3 subagents in parallel:
Agent A (multi-stage):
skills/researcher.md → skills/personalizer.md → skills/qa.md
→ outputs/variant_a.json
Agent B (mega-prompt):
skills/all-in-one.md
→ outputs/variant_b.json
Agent C (critique loop):
skills/draft-critique-rewrite.md
→ outputs/variant_c.jsonWhen to apply (and when not to)
- Prompt template changes (any modification)
- Agent architecture decisions (single vs multi)
- LLM provider switches (Claude vs GPT vs MiniMax)
- Tool definition shapes
- Output format choices (JSON vs tool call vs file)
- Workflow ordering (sequential vs parallel)
- Context window strategies
- Performance at scale (latency, cost at high volume)
- Multi-user concurrent testing
- Real-time integration testing
- Production database constraints
- Long-running stateful workflows
- Infrastructure-specific quirks
- Safety-critical systems
- Financial correctness guarantees
- Concurrency + distributed-systems correctness
→ For ❌ cases: validate prompt logic in experimentation runtime first, then test infrastructure in a separate staging environment.
Failure modes of this method
Things that have actually gone wrong, or that I watch for:
- Benchmark overfitting — tuned for Joshua and Evan, failed on Sandra (prospect 3). Test on more than 2 examples before declaring done.
- Weak evaluation — if eval criteria are loose, the "winner" is just whichever variant looks most convincing. Garbage eval → garbage decision.
- Local optimum — the best of 7 variants tried ≠ the globally best variant. You're sampling, not exhausting the search space.
- Sim/prod mismatch — runs smoothly in coding tool ≠ performs at scale, under concurrency, or with real API rate limits.
- Prompt spaghetti — fast iteration breeds prompts nobody can maintain. Document why each rule exists while you still remember.
- Architecture cargo-culting — copying 3-agent because someone else uses it, not because your data says so.
When the production build is worth it
- Prompt validated across ≥10 real examples
- Output quality stable across variations
- Edge cases catalogued + handled
- Customer is EXTERNAL (not just yourself)
- Need 24/7 uptime
- Volume ≥50 calls/day
- Cost optimization needed
- Low volume (Claude Code session is enough)
- Internal use (you operate it yourself)
- Workflow still evolving
- Variants not yet stable
- Customer feedback still shaping
Why this method wins
- A/B test: 30 min (vs 2 weeks)
- Architecture validation: from weeks of build-and-measure to a few hours of structured comparison
- Experimentation: $0 (subagent free)
- Distillation: $1–3K Opus
- 70–90% cheaper per call (based on published token pricing between frontier and cheap models)
- Decisions grounded in observed outputs, not assumptions
- Better edge-case coverage
- Lower regression risk
- Failed experiments cost only the time to run them
- Pivot freely
- Production = only validated patterns
The principle is bigger than AI agents — "push uncertainty into the cheapest environment before making an expensive commitment". Like test before deploy, prototype before build. For AI: subagent before production code.
If you have a prompt decision that's been sitting unresolved, this is a low-cost way to get signal. Spawn a few variants, run them on real inputs, and see what the data suggests. I'd be curious to hear what you find.