plan-orchestrate · live test + v2 design · 2026-06-23

AI Usage Funnel shipped — and what it taught us about the orchestrator

A real feature was built, deployed (v088→v089) and QA'd end-to-end as a live test of the plan-orchestrate super-skill; an Opus agent then analysed the run and proposed a single-prompt, self-continuing v2.

🟢 shipped & verified live 🟡 proposal / not yet built 🔵 human gate (stays manual)

1 · What shipped this run live

The parked AI Usage Funnel plan was executed to done: per-skill / per-category AI spend with unit economics ($/min of TTS audio), on /dash#usage. Branch plan-orchestrate/ai-usage-funnel.

AreaChangeProof
app/usage.pyRecovered 5 dropped PROMPT_KEYS + unknown; record(*, category, unit, unit_kind) keyword-only & back-compat; read-time CATEGORIES mapAU1, AU2, AU9, AU10
app/config.pyTTS audio-output pricing ($20/1M), ordered before gemini-2.5-pro so prefix-match winsAU4ai_cost(…tts,100,1500)=0.0301
app/dash.pyToken-guarded POST /dash/api/usage/ingest; api_usage emits per_category + server-side derived ($/min·$/lead·$/call)AU3, AU6, AU7
app/dash_js.py"📊 By category" table + unit column; wrapped overflow-x:auto (the visual-QA fix)AU8 + visual-QA
Laptop TTS skillsynth.py/run.py capture usage_metadata & POST fail-open via new _shared/usage_post.pyAU5
DASHBOARD_GUIDE.mdMonthly manual reconciliation recipe (Chrome DevTools MCP, never Puppeteer)AU11

QA: project skill ai-usage-qa — all AU1–AU11 PASS live (v088). Visual-QA at 1280 / 768 / 390 PASS after one fix (a 7-column table overflowed mobile → wrapped each table in an overflow-x:auto container, redeployed v089, re-verified). Live tab shows real spend: 7 days = 1084 calls · $1.85; TTS $/min = $0.0301.

2 · Verdict on the orchestrator feasible

The single-prompt loop is feasible today for the plan→techspec→QA-scaffold→execute spine — it just ran. The blocker is not the AI; it is the live-environment seam.

The one biggest blocker: snap_deploy.sh is not git-branch-isolated (one live Modal app), and the laptop store (.tmp/deals.json) is not prod (modal.Dict) — so all live QA seeding must go through the deployed ingest endpoint. Every friction this run hit was solved by hand and never written back into the reusable skills, so the next run re-derives them. v2's durable win is removing that re-derivation tax, not rewriting the orchestration.

3 · The proposed v2 loop 🟡 design

Not a rewrite — the existing orchestrator plus four additions. The human touches the loop exactly twice.

① One prompt → ≤4 questions 🔵
Phase-0 orient (graphify + chat-graph + PROJECT_STATE) then the upfront intent batch → intent.json. Human stop #1. (On this run Instance 1 needed zero rounds — keep verbatim.)
② Plan → techspec → QA scaffold
File-bus ⇄ re-spawn (Opus per instance; orchestrator self-answers from graph + Sonnet research). New: declare target_kind: card|state so the QA scaffold fits API-driven tabs.
③ Execute + visual-qa-ultra 🔵
Edits → snap_deploy → seed-via-ingest → checks green; inherits the pre-flight defaults pack. Audit mode always-auto; journey grade gated on a signed EBO. Human stop #2.
④ Emit handoff → next prompt
Closing stage emits a house-style HANDOFF.html (with honest Limitations) carrying a pastable next-instance prompt → the next instance picks the next-highest-leverage task. The flywheel.

4 · Pre-flight hardening — bake the frictions in 10

The guardrails distilled from this run so they are never re-derived. Highest-leverage first.

#GuardrailWhy (this run)
H2 ⭐Never networkidle → use domcontentloaded + ready-selector + retry; keep nav inside the per-check tryIt aborted the whole --all survey — and it is STILL in the global qa-harness (assert.py:207, capture.py:230), so fixing it once helps every future project
H1SSL context for laptop POSTs (certifi → unverified fallback)macOS Python has no CA bundle → CERTIFICATE_VERIFY_FAILED
H3Accumulation-robust live asserts; exact values only in local unit testsrecord() accumulates → "unit==42" held only on a single seed
H4Prefix-match ordering for prefix-keyed lookups + a guard assert'…pro-preview-tts'.startswith('gemini-2.5-pro')
H7Live seeds via the deployed ingest endpoint, never flows.store()Local store ≠ prod store
H8Never two deploys / run_matrix against the one Modal app concurrentlyShared live app; same-tree edits merge-conflict
H5/H6/H9/H10$status is reserved in zsh · inspect disk on a 529 (not failure) · read live source for prices/SDK shapes · cap rounds 5 + research fan-out 3Each a real event of this + the planning run

5 · The self-continuing handoff 🟡

Every run ends by emitting a human-review doc that contains the prompt for the next run. Here is the filled-in next-instance prompt the proposal generated — the next-highest-leverage task (instrument graphify spend, reusing the rails this run built):

/plan-orchestrate

FEATURE: Instrument the `graphify` laptop skill's Gemini spend into the AI Usage tab — the
last uninstrumented high-volume AI spender. The AI Usage Funnel (branch
plan-orchestrate/ai-usage-funnel, AU1–AU11 green @ v088) already built the rails:
- token-guarded POST /dash/api/usage/ingest (fail-open),
- the shared poster ~/.claude/skills/_shared/usage_post.py,
- read-time category map app/usage.py:CATEGORIES + the "By category" tab block.
graphify was DEFERRED (ADR-0005 / HANDOFF §9.4) because whether its CLI exposes Gemini token
counts cheaply is UNCONFIRMED — RESOLVE THAT FIRST. If counts are exposed, POST
{source:'graphify', category:'Graphify', model, in, out} fail-open; else log a per-run estimate.

CONSTRAINTS (hard): ONE Gemini key; laptop POST fail-open; ZZ-gated + send-blocklisted QA via a
new ai-graphify-qa skill (state target → seed --via-ingest, no exact-count live asserts); light
theme; deploy via ./scripts/snap_deploy.sh to the ONE live Modal app (sequence, no concurrent
run_matrix); human gate intact.

Run the chain hands-off on branch plan-orchestrate/graphify-usage, /visual-qa-ultra it (audit
always; journey only if I sign the EBO), then emit the house-style HANDOFF.html with the next prompt.

6 · Build-first & honest risks

Smallest change, most autonomy gained — then where a single-prompt run still needs you.

Build firstEffortHonest risk that remains
H2: patch networkidle in the global qa-harness½ dayEBO signature — a green journey grade requires a human sign-off by design; autonomy caps at PASSED_WITH_GAPS until then 🔵
Closing handoff-emit stage (emit_handoff.py → HANDOFF.html + next prompt)1 dayA thin/wrong brief ships a wrong feature confidently — the upfront questionnaire is the only real defence
target_kind card/state branch in build-qa-skill1 dayNon-reversible deploy — autonomy rests entirely on ZZ-gate + send-blocklist + AUTOSEND-off
Pre-flight defaults pack (promote seeder/check patterns)½ dayPrice / external-API drift silently corrupts unit economics with green checks (H9 mitigates)