Agentic Notes
May 28, 2026
EssayMethodology · 11 min read

Validate AI Workflows Before You BuildNo-Build-First Method

A tool-agnostic methodology to separate experimentation runtime from production runtime — saving weeks of engineering before writing a single line of production code.

Experimentation runtimeABC3 variants · 30 min · $0distillvalidated patternsProduction runtimeA24/7cheap model · distilled prompts · scale

The decision that used to take six weeks

Last week I needed to decide the architecture for an AI agent: single-agent or multi-agent? The old way would be to build both, deploy an A/B test, measure metrics for 2 weeks, refactor — totalling 6–8 weeks of engineering.

The new way: spawn 3 subagent variants in Claude Code, run the same input, compare outputs side-by-side — the whole thing took an afternoon. The winner went to production.

This isn't a shortcut. It's a methodology that works across AI coding tools — Claude Code, Cursor, OpenCode, Cline. The tool matters less than the principle.

Why the traditional loop wastes weeks

When building an AI agent system, most devs follow this pattern:

  1. 01Design architecture1 week
  2. 02Build backend infrastructure2–3 weeks
  3. 03Build prompt logic1 week
  4. 04Deploy + test1 week
  5. 05Find issues → refactor2–4 weeks
  6. 06Repeat
Total time
6–12 weeks / decision
Risk
Sunk-cost bias when wrong

The core problem: confusing experimentation with production. Production code is hard to change — schema migrations, test infrastructure, deployment pipelines, rollback complexity. Experimentation needs to be fast, cheap, disposable. These two concerns MUST be separated.

Two runtimes, two jobs

AI coding tools (Claude Code, Cursor, OpenCode) fit the experimentation phase well. Production webservice is the scaling runtime — built only when patterns are proven.

AI coding tools

Experimentation runtime

  • Free subagent spawning
  • Tool calling built-in
  • File I/O instant
  • Context management automatic
  • Disposable experiments
  • Real-time iteration

Subagent spawning in subscription-based coding tools doesn't charge tokens like a production API — so iteration cost ≈ $0. Note: pricing models may change per tool/time.

SDK + webservice

Production runtime

  • Built only when patterns proven
  • Cheap models with distilled prompts
  • 24/7 uptime
  • Multi-tenant
  • Code-level validators
  • Cost-optimized per call

Rule: Validate in experimentation runtime first. Build for production only after patterns are stable.

Four steps from idea to production

01

Validate workflow with an AI coding tool

Open Claude Code / Cursor / OpenCode. Create a CLAUDE.md with context, build 5–10 simple tools (Python functions), spawn a subagent to run the workflow end-to-end, iterate prompts in real-time.

Need
  • Markdown files (skills / SOPs)
  • Python functions (tools)
  • Sample data (5–10 real examples)
Skip
  • Backend infrastructure
  • Frontend UI
  • Database
  • Deployment
02

A/B test with subagents

For architecture decisions: spawn variants in parallel, compare outputs on the same inputs, and pick the pattern that holds up most consistently. No production code is touched during this phase.

  • Variant ASingle mega-agent
  • Variant BMulti-agent chain
  • Variant CSelf-critique loop
03

Distill with the best model

Use Opus / GPT-5 for the LEARNING phase: run 100+ workflow executions, document what works / what doesn't, extract minimum prompts, build eval sets.

Output: Production-grade system promptsCost: $1–3K total Opus usage

Why does distillation work? Opus reasoning gets packaged into a condensed prompt — the cheaper model inherits that wisdom via instruction instead of re-reasoning from scratch. This is knowledge distillation at the prompt layer, not the model weights layer. Pay the reasoning cost once (learning), apply cheaply infinite times (production).

Note: I use 'distillation' informally — compressing expensive workflow reasoning into reusable prompts. This is not model-weight distillation in the ML sense.

04

Migrate to production

Once validated: replace AI coding tool → SDK + API, replace expensive model → cheap model with distilled prompts, add code-level validators, deploy infrastructure.

Cost reduction: 70–90% per call (based on published token pricing between frontier and cheap models)

18 spawns in a single session

Question: Cold email agent architecture for project D2 — 3-agent chain vs 1-agent monster vs critique loop?

OLD WAYBuild all variantsweeks of workDeploy + A/B testinfra requiredMeasure (~2 weeks)wait and watchRefactor if wrongback to startWeeks of engineering time · production code touched on every iterationNEW WAYExperimentation runtimeVar Aspawn+ evalVar Bspawn+ evalVar Cspawn+ evalcomparehoursDistillwinner → promptfrontier modellearning phasebuild onceProductioncheap modeldistilled promptsvalidated winner onlyHours · experimentation cost ≈ $0 · production code only for the validated winner

How we evaluated each variant

Dataset: 3 real prospects — Joshua (VP Engineering, SaaS), Evan (Founder, logistics), Sandra (Head of Growth, fintech). Each variant ran on all 3. Metrics per output:

MetricHow measuredType
Fabrication countRegex scan for unverifiable claims (numbers, titles)Deterministic
Word-count compliancelen(output) within [180, 260] rangeDeterministic
Schema validityJSON keys present + non-emptyDeterministic
Tactical framingSeparate agent: does hook address a real pain point?Semantic

Sandra (fintech) broke 2 variants that passed Joshua and Evan — which is why you need more than 1 test case before declaring a winner.

Findings from 18 spawns

Self-QA bias real

The subagent self-scores "pass" while the output objectively fails. Don't trust agents to self-evaluate — you need an external validator or an independent second agent.

Anti-fab ↔ word-count tension

Enforcing anti-fabrication shrinks the email; loosening it causes the agent to invent numbers. Can't be solved by prompt alone — need a code validator after each stage.

Context dilution effect

Critical rules buried in a mega-prompt get ignored. Critical rules must live in a dedicated space at the start of the context — not buried mid-instruction.

Prompt > Architecture

Hook quality comes from specific wording in the prompt, not from the number of agents. A single-agent with a great prompt beats a 3-agent chain with a lazy one.

Each architecture has DIFFERENT failure modes

Single-agent: context overflow. Multi-agent: coordination overhead. Critique loop: over-polished output loses natural tone. None is universally clean — know the failure modes upfront to design hedges.

Production decision

  • 3-agent chain
  • Anti-fab in BASE prompt
  • Python validators between stages
  • QA semantic only

Evaluation is the bottleneck

Generation is straightforward — any tool can spawn variants. The harder problem is deciding which one is actually better, and on what criteria.

In those 18 spawns, the most time-consuming part wasn't spawning variants — it was evaluating them. Generating 7 email variants took about 20 minutes. Deciding which one was best took the other 6 hours.

The evaluation methods we actually used, in order of effort:

1. Deterministic validator (Python)

Word count, fabrication regex, schema key checks. Fast, cheap, catches obvious failures. Ran first — anything failing here got cut before manual review.

2. Semantic check (separate agent)

A second agent asked: "Does this hook address a real, specific pain point for this prospect?" Slower and costs tokens, but catches subtler failure modes the regex misses.

3. Manual side-by-side

Read the surviving variants out loud. The one that doesn't make you cringe is the winner. Unscalable, but for high-stakes output it's irreplaceable.

The lesson: any tool can spawn variants. What separates good experimentation from brute-force prompting is having a way to measure 'better'. No eval = no signal.

Pick what feels comfortable

Method works with all of them. Compact comparison:

ToolVendorBest forPriceNotable
Claude CodeAnthropicClaude-centric, async workflows$20–200/mo
  • Native subagent (Task tool)
  • Skills folder pattern
  • Extensive MCP support
CursorAnysphereDevelopers in IDE$20/mo Pro
  • IDE-integrated
  • Composer multi-file
  • Tab autocomplete + chat
OpenCodeOpen SourcePrivacy / custom needsFree + your API
  • Self-hostable
  • Provider flexibility
  • Customizable workflows
ClineVS Code Ext.VS Code usersFree + API
  • Plan / Act modes
  • Bring your own key
  • Free extension
AiderOpen SourceCLI loversFree + API
  • Terminal-based
  • Git-aware
  • Multi-file edits

From experimentation to production

Project structuretext
project/
├── CLAUDE.md          # or .cursorrules, AGENTS.md
├── tools/
│   ├── apollo_search.py
│   ├── email_send.py
│   └── storage.py
├── skills/
│   ├── outbound-research.md
│   └── personalization.md
└── data/
    └── test_prospects.json
tools/apollo_search.pypython
def apollo_search(
    industry: str,
    size_range: str,
    location: str,
) -> list:
    """Search Apollo for prospects matching criteria."""
    # ... implementation
    return prospects
A/B test prompttext
Spawn 3 subagents in parallel:

Agent A (multi-stage):
  skills/researcher.md → skills/personalizer.md → skills/qa.md
  → outputs/variant_a.json

Agent B (mega-prompt):
  skills/all-in-one.md
  → outputs/variant_b.json

Agent C (critique loop):
  skills/draft-critique-rewrite.md
  → outputs/variant_c.json

When to apply (and when not to)

Apply for
  • Prompt template changes (any modification)
  • Agent architecture decisions (single vs multi)
  • LLM provider switches (Claude vs GPT vs MiniMax)
  • Tool definition shapes
  • Output format choices (JSON vs tool call vs file)
  • Workflow ordering (sequential vs parallel)
  • Context window strategies
Does not apply for
  • Performance at scale (latency, cost at high volume)
  • Multi-user concurrent testing
  • Real-time integration testing
  • Production database constraints
  • Long-running stateful workflows
  • Infrastructure-specific quirks
  • Safety-critical systems
  • Financial correctness guarantees
  • Concurrency + distributed-systems correctness

For ❌ cases: validate prompt logic in experimentation runtime first, then test infrastructure in a separate staging environment.

Failure modes of this method

Things that have actually gone wrong, or that I watch for:

  • Benchmark overfitting — tuned for Joshua and Evan, failed on Sandra (prospect 3). Test on more than 2 examples before declaring done.
  • Weak evaluation — if eval criteria are loose, the "winner" is just whichever variant looks most convincing. Garbage eval → garbage decision.
  • Local optimum — the best of 7 variants tried ≠ the globally best variant. You're sampling, not exhausting the search space.
  • Sim/prod mismatch — runs smoothly in coding tool ≠ performs at scale, under concurrency, or with real API rate limits.
  • Prompt spaghetti — fast iteration breeds prompts nobody can maintain. Document why each rule exists while you still remember.
  • Architecture cargo-culting — copying 3-agent because someone else uses it, not because your data says so.

When the production build is worth it

Ready (checklist)
  • Prompt validated across ≥10 real examples
  • Output quality stable across variations
  • Edge cases catalogued + handled
  • Customer is EXTERNAL (not just yourself)
  • Need 24/7 uptime
  • Volume ≥50 calls/day
  • Cost optimization needed
Don't migrate yet
  • Low volume (Claude Code session is enough)
  • Internal use (you operate it yourself)
  • Workflow still evolving
  • Variants not yet stable
  • Customer feedback still shaping

Why this method wins

Velocity
weeks → hours
iteration cycle
  • A/B test: 30 min (vs 2 weeks)
  • Architecture validation: from weeks of build-and-measure to a few hours of structured comparison
Cost
frontier → cheap
model swap after distillation
  • Experimentation: $0 (subagent free)
  • Distillation: $1–3K Opus
  • 70–90% cheaper per call (based on published token pricing between frontier and cheap models)
Quality
Day 1
production-ready prompts
  • Decisions grounded in observed outputs, not assumptions
  • Better edge-case coverage
  • Lower regression risk
Risk
0
sunk cost
  • Failed experiments cost only the time to run them
  • Pivot freely
  • Production = only validated patterns
Honest limit: This method shortens the decision/validation phase — it does NOT replace production testing. Latency, concurrency, cost-at-scale, multi-tenant isolation still need to be tested in real staging. It helps you build the right thing, faster — not skip all of engineering.
Takeaway

The principle is bigger than AI agents — "push uncertainty into the cheapest environment before making an expensive commitment". Like test before deploy, prototype before build. For AI: subagent before production code.

If you have a prompt decision that's been sitting unresolved, this is a low-cost way to get signal. Spawn a few variants, run them on real inputs, and see what the data suggests. I'd be curious to hear what you find.


◆ ◆ ◆