Agentic Notes

01Opening

The decision that used to take six weeks

Last week I needed to decide the architecture for an AI agent: single-agent or multi-agent? The old way would be to build both, deploy an A/B test, measure metrics for 2 weeks, refactor — totalling 6–8 weeks of engineering.

The new way: spawn 3 subagent variants in Claude Code, run the same input, compare outputs side-by-side — the whole thing took an afternoon. The winner went to production.

This isn't a shortcut. It's a methodology that works across AI coding tools — Claude Code, Cursor, OpenCode, Cline. The tool matters less than the principle.

02The problem

Why the traditional loop wastes weeks

When building an AI agent system, most devs follow this pattern:

01Design architecture1 week
02Build backend infrastructure2–3 weeks
03Build prompt logic1 week
04Deploy + test1 week
05Find issues → refactor2–4 weeks
06Repeat∞

Total time: 6–12 weeks / decision
Risk: Sunk-cost bias when wrong

The core problem: confusing experimentation with production. Production code is hard to change — schema migrations, test infrastructure, deployment pipelines, rollback complexity. Experimentation needs to be fast, cheap, disposable. These two concerns MUST be separated.

03Core insight

Two runtimes, two jobs

AI coding tools (Claude Code, Cursor, OpenCode) fit the experimentation phase well. Production webservice is the scaling runtime — built only when patterns are proven.

AI coding tools

Experimentation runtime

Free subagent spawning
Tool calling built-in
File I/O instant
Context management automatic
Disposable experiments
Real-time iteration

Subagent spawning in subscription-based coding tools doesn't charge tokens like a production API — so iteration cost ≈ $0. Note: pricing models may change per tool/time.

SDK + webservice

Production runtime

Built only when patterns proven
Cheap models with distilled prompts
24/7 uptime
Multi-tenant
Code-level validators
Cost-optimized per call

Rule: Validate in experimentation runtime first. Build for production only after patterns are stable.

04The method

Four steps from idea to production

Validate workflow with an AI coding tool

Open Claude Code / Cursor / OpenCode. Create a CLAUDE.md with context, build 5–10 simple tools (Python functions), spawn a subagent to run the workflow end-to-end, iterate prompts in real-time.

Need

Markdown files (skills / SOPs)
Python functions (tools)
Sample data (5–10 real examples)

Skip

Backend infrastructure
Frontend UI
Database
Deployment

A/B test with subagents

For architecture decisions: spawn variants in parallel, compare outputs on the same inputs, and pick the pattern that holds up most consistently. No production code is touched during this phase.

Variant ASingle mega-agent
Variant BMulti-agent chain
Variant CSelf-critique loop

Distill with the best model

Use Opus / GPT-5 for the LEARNING phase: run 100+ workflow executions, document what works / what doesn't, extract minimum prompts, build eval sets.

Output: Production-grade system promptsCost: covered by your coding-tool subscription ($20–200/mo) — the same flat fee, not metered API tokens. Note: subscription limits and pricing change per tool and over time.

Why does distillation work? Opus reasoning gets packaged into a condensed prompt — the cheaper model inherits that wisdom via instruction instead of re-reasoning from scratch. This is knowledge distillation at the prompt layer, not the model weights layer. Pay the reasoning cost once (learning), apply cheaply infinite times (production).

Note: I use 'distillation' informally — compressing expensive workflow reasoning into reusable prompts. This is not model-weight distillation in the ML sense.

Migrate to production

Once validated: replace AI coding tool → SDK + API, replace expensive model → cheap model with distilled prompts, add code-level validators, deploy infrastructure.

Cost reduction: 70–90% per call (based on published token pricing between frontier and cheap models)

05Real example

18 spawns in a single session

Question: Cold email agent architecture for project D2 — 3-agent chain vs 1-agent monster vs critique loop?

How we evaluated each variant

Dataset: 3 real prospects — Joshua (VP Engineering, SaaS), Evan (Founder, logistics), Sandra (Head of Growth, fintech). Each variant ran on all 3. Metrics per output:

Metric	How measured	Type
Fabrication count	Regex scan for unverifiable claims (numbers, titles)	Deterministic
Word-count compliance	len(output) within [180, 260] range	Deterministic
Schema validity	JSON keys present + non-empty	Deterministic
Tactical framing	Separate agent: does hook address a real pain point?	Semantic

Sandra (fintech) broke 2 variants that passed Joshua and Evan — which is why you need more than 1 test case before declaring a winner.

Findings from 18 spawns

Self-QA bias real

The subagent self-scores "pass" while the output objectively fails. Don't trust agents to self-evaluate — you need an external validator or an independent second agent.

Anti-fab ↔ word-count tension

Enforcing anti-fabrication shrinks the email; loosening it causes the agent to invent numbers. Can't be solved by prompt alone — need a code validator after each stage.

Context dilution effect

Critical rules buried in a mega-prompt get ignored. Critical rules must live in a dedicated space at the start of the context — not buried mid-instruction.

Prompt > Architecture

Hook quality comes from specific wording in the prompt, not from the number of agents. A single-agent with a great prompt beats a 3-agent chain with a lazy one.

Each architecture has DIFFERENT failure modes

Single-agent: context overflow. Multi-agent: coordination overhead. Critique loop: over-polished output loses natural tone. None is universally clean — know the failure modes upfront to design hedges.

Production decision

3-agent chain
Anti-fab in BASE prompt
Python validators between stages
QA semantic only

06Key insight

Evaluation is the bottleneck

Generation is straightforward — any tool can spawn variants. The harder problem is deciding which one is actually better, and on what criteria.

In those 18 spawns, the most time-consuming part wasn't spawning variants — it was evaluating them. Generating 7 email variants took about 20 minutes. Deciding which one was best took the other 6 hours.

The evaluation methods we actually used, in order of effort:

1. Deterministic validator (Python)

Word count, fabrication regex, schema key checks. Fast, cheap, catches obvious failures. Ran first — anything failing here got cut before manual review.

2. Semantic check (separate agent)

A second agent asked: "Does this hook address a real, specific pain point for this prospect?" Slower and costs tokens, but catches subtler failure modes the regex misses.

3. Manual side-by-side

Read the surviving variants out loud. The one that doesn't make you cringe is the winner. Unscalable, but for high-stakes output it's irreplaceable.

The lesson: any tool can spawn variants. What separates good experimentation from brute-force prompting is having a way to measure 'better'. No eval = no signal.

And here is the part this post only points at: that evaluation IS a loss function — and someone has to define it. The model can't; it falls back to a generic average. You define it, up front, and you keep correcting it. Validate is where you pick the starting point and seed that first definition of 'good'. What comes after — turning that seed into an agent that actually improves — is its own loop. I wrote it up separately: Agent-Building Is Training.

07Tool options

Pick what feels comfortable

Method works with all of them. Compact comparison:

Tool	Vendor	Best for	Price	Notable
Claude Code	Anthropic	Claude-centric, async workflows	$20–200/mo	Native subagent (Task tool) Skills folder pattern Extensive MCP support
Cursor	Anysphere	Developers in IDE	$20/mo Pro	IDE-integrated Composer multi-file Tab autocomplete + chat
OpenCode	Open Source	Privacy / custom needs	Free + your API	Self-hostable Provider flexibility Customizable workflows
Cline	VS Code Ext.	VS Code users	Free + API	Plan / Act modes Bring your own key Free extension
Aider	Open Source	CLI lovers	Free + API	Terminal-based Git-aware Multi-file edits

08Code examples

From experimentation to production

Project structuretext

project/
├── CLAUDE.md          # or .cursorrules, AGENTS.md
├── tools/
│   ├── apollo_search.py
│   ├── email_send.py
│   └── storage.py
├── skills/
│   ├── outbound-research.md
│   └── personalization.md
└── data/
    └── test_prospects.json

tools/apollo_search.pypython

def apollo_search(
    industry: str,
    size_range: str,
    location: str,
) -> list:
    """Search Apollo for prospects matching criteria."""
    # ... implementation
    return prospects

A/B test prompttext

Spawn 3 subagents in parallel:

Agent A (multi-stage):
  skills/researcher.md → skills/personalizer.md → skills/qa.md
  → outputs/variant_a.json

Agent B (mega-prompt):
  skills/all-in-one.md
  → outputs/variant_b.json

Agent C (critique loop):
  skills/draft-critique-rewrite.md
  → outputs/variant_c.json

09Scope

When to apply (and when not to)

Apply for

Prompt template changes (any modification)
Agent architecture decisions (single vs multi)
LLM provider switches (Claude vs GPT vs MiniMax)
Tool definition shapes
Output format choices (JSON vs tool call vs file)
Workflow ordering (sequential vs parallel)
Context window strategies

Does not apply for

Performance at scale (latency, cost at high volume)
Multi-user concurrent testing
Real-time integration testing
Production database constraints
Long-running stateful workflows
Infrastructure-specific quirks
Safety-critical systems
Financial correctness guarantees
Concurrency + distributed-systems correctness

→ For ❌ cases: validate prompt logic in experimentation runtime first, then test infrastructure in a separate staging environment.

10Self-critique

Failure modes of this method

Things that have actually gone wrong, or that I watch for:

Benchmark overfitting — tuned for Joshua and Evan, failed on Sandra (prospect 3). Test on more than 2 examples before declaring done.
Weak evaluation — if eval criteria are loose, the "winner" is just whichever variant looks most convincing. Garbage eval → garbage decision.
Local optimum — the best of 7 variants tried ≠ the globally best variant. You're sampling, not exhausting the search space.
Sim/prod mismatch — runs smoothly in coding tool ≠ performs at scale, under concurrency, or with real API rate limits.
Prompt spaghetti — fast iteration breeds prompts nobody can maintain. Document why each rule exists while you still remember.
Architecture cargo-culting — copying 3-agent because someone else uses it, not because your data says so.

11Migration criteria

When the production build is worth it

Ready (checklist)

Prompt validated across ≥10 real examples
Output quality stable across variations
Edge cases catalogued + handled
Customer is EXTERNAL (not just yourself)
Need 24/7 uptime
Volume ≥50 calls/day
Cost optimization needed

Don't migrate yet

Low volume (Claude Code session is enough)
Internal use (you operate it yourself)
Workflow still evolving
Variants not yet stable
Customer feedback still shaping

12Closing

Why this method wins

Velocity

weeks → hours

iteration cycle

A/B test: 30 min (vs 2 weeks)
Architecture validation: from weeks of build-and-measure to a few hours of structured comparison

Cost

frontier → cheap

model swap after distillation

Experimentation: $0 (subagent free)
Distillation runs inside the same subscription — no separate per-token bill
70–90% cheaper per call (based on published token pricing between frontier and cheap models)

Quality

Day 1

production-ready prompts

Decisions grounded in observed outputs, not assumptions
Better edge-case coverage
Lower regression risk

Risk

sunk cost

Failed experiments cost only the time to run them
Pivot freely
Production = only validated patterns

Honest limit: This method shortens the decision/validation phase — it does NOT replace production testing. Latency, concurrency, cost-at-scale, multi-tenant isolation still need to be tested in real staging. It helps you build the right thing, faster — not skip all of engineering.

Takeaway

The principle is bigger than AI agents — "push uncertainty into the cheapest environment before making an expensive commitment". Like test before deploy, prototype before build. For AI: subagent before production code.

If you have a prompt decision that's been sitting unresolved, this is a low-cost way to get signal. Spawn a few variants, run them on real inputs, and see what the data suggests. I'd be curious to hear what you find.