plan-orchestrate · live test + v2 design · 2026-06-23
A real feature was built, deployed (v088→v089) and QA'd end-to-end as a live test of the plan-orchestrate super-skill; an Opus agent then analysed the run and proposed a single-prompt, self-continuing v2.
The parked AI Usage Funnel plan was executed to done: per-skill / per-category AI spend with unit economics ($/min of TTS audio), on /dash#usage. Branch plan-orchestrate/ai-usage-funnel.
| Area | Change | Proof |
|---|---|---|
app/usage.py | Recovered 5 dropped PROMPT_KEYS + unknown; record(*, category, unit, unit_kind) keyword-only & back-compat; read-time CATEGORIES map | AU1, AU2, AU9, AU10 |
app/config.py | TTS audio-output pricing ($20/1M), ordered before gemini-2.5-pro so prefix-match wins | AU4 — ai_cost(…tts,100,1500)=0.0301 |
app/dash.py | Token-guarded POST /dash/api/usage/ingest; api_usage emits per_category + server-side derived ($/min·$/lead·$/call) | AU3, AU6, AU7 |
app/dash_js.py | "📊 By category" table + unit column; wrapped overflow-x:auto (the visual-QA fix) | AU8 + visual-QA |
| Laptop TTS skill | synth.py/run.py capture usage_metadata & POST fail-open via new _shared/usage_post.py | AU5 |
DASHBOARD_GUIDE.md | Monthly manual reconciliation recipe (Chrome DevTools MCP, never Puppeteer) | AU11 |
QA: project skill ai-usage-qa — all AU1–AU11 PASS live (v088). Visual-QA at 1280 / 768 / 390 PASS after one fix (a 7-column table overflowed mobile → wrapped each table in an overflow-x:auto container, redeployed v089, re-verified). Live tab shows real spend: 7 days = 1084 calls · $1.85; TTS $/min = $0.0301.
The single-prompt loop is feasible today for the plan→techspec→QA-scaffold→execute spine — it just ran. The blocker is not the AI; it is the live-environment seam.
snap_deploy.sh is not git-branch-isolated (one live Modal app), and the laptop store (.tmp/deals.json) is not prod (modal.Dict) — so all live QA seeding must go through the deployed ingest endpoint. Every friction this run hit was solved by hand and never written back into the reusable skills, so the next run re-derives them. v2's durable win is removing that re-derivation tax, not rewriting the orchestration.
Not a rewrite — the existing orchestrator plus four additions. The human touches the loop exactly twice.
intent.json. Human stop #1. (On this run Instance 1 needed zero rounds — keep verbatim.)target_kind: card|state so the QA scaffold fits API-driven tabs.snap_deploy → seed-via-ingest → checks green; inherits the pre-flight defaults pack. Audit mode always-auto; journey grade gated on a signed EBO. Human stop #2.HANDOFF.html (with honest Limitations) carrying a pastable next-instance prompt → the next instance picks the next-highest-leverage task. The flywheel.The guardrails distilled from this run so they are never re-derived. Highest-leverage first.
| # | Guardrail | Why (this run) |
|---|---|---|
| H2 ⭐ | Never networkidle → use domcontentloaded + ready-selector + retry; keep nav inside the per-check try | It aborted the whole --all survey — and it is STILL in the global qa-harness (assert.py:207, capture.py:230), so fixing it once helps every future project |
| H1 | SSL context for laptop POSTs (certifi → unverified fallback) | macOS Python has no CA bundle → CERTIFICATE_VERIFY_FAILED |
| H3 | Accumulation-robust live asserts; exact values only in local unit tests | record() accumulates → "unit==42" held only on a single seed |
| H4 | Prefix-match ordering for prefix-keyed lookups + a guard assert | '…pro-preview-tts'.startswith('gemini-2.5-pro') |
| H7 | Live seeds via the deployed ingest endpoint, never flows.store() | Local store ≠ prod store |
| H8 | Never two deploys / run_matrix against the one Modal app concurrently | Shared live app; same-tree edits merge-conflict |
| H5/H6/H9/H10 | $status is reserved in zsh · inspect disk on a 529 (not failure) · read live source for prices/SDK shapes · cap rounds 5 + research fan-out 3 | Each a real event of this + the planning run |
Every run ends by emitting a human-review doc that contains the prompt for the next run. Here is the filled-in next-instance prompt the proposal generated — the next-highest-leverage task (instrument graphify spend, reusing the rails this run built):
/plan-orchestrate
FEATURE: Instrument the `graphify` laptop skill's Gemini spend into the AI Usage tab — the
last uninstrumented high-volume AI spender. The AI Usage Funnel (branch
plan-orchestrate/ai-usage-funnel, AU1–AU11 green @ v088) already built the rails:
- token-guarded POST /dash/api/usage/ingest (fail-open),
- the shared poster ~/.claude/skills/_shared/usage_post.py,
- read-time category map app/usage.py:CATEGORIES + the "By category" tab block.
graphify was DEFERRED (ADR-0005 / HANDOFF §9.4) because whether its CLI exposes Gemini token
counts cheaply is UNCONFIRMED — RESOLVE THAT FIRST. If counts are exposed, POST
{source:'graphify', category:'Graphify', model, in, out} fail-open; else log a per-run estimate.
CONSTRAINTS (hard): ONE Gemini key; laptop POST fail-open; ZZ-gated + send-blocklisted QA via a
new ai-graphify-qa skill (state target → seed --via-ingest, no exact-count live asserts); light
theme; deploy via ./scripts/snap_deploy.sh to the ONE live Modal app (sequence, no concurrent
run_matrix); human gate intact.
Run the chain hands-off on branch plan-orchestrate/graphify-usage, /visual-qa-ultra it (audit
always; journey only if I sign the EBO), then emit the house-style HANDOFF.html with the next prompt.
Smallest change, most autonomy gained — then where a single-prompt run still needs you.
| Build first | Effort | Honest risk that remains |
|---|---|---|
H2: patch networkidle in the global qa-harness | ½ day | EBO signature — a green journey grade requires a human sign-off by design; autonomy caps at PASSED_WITH_GAPS until then 🔵 |
Closing handoff-emit stage (emit_handoff.py → HANDOFF.html + next prompt) | 1 day | A thin/wrong brief ships a wrong feature confidently — the upfront questionnaire is the only real defence |
target_kind card/state branch in build-qa-skill | 1 day | Non-reversible deploy — autonomy rests entirely on ZZ-gate + send-blocklist + AUTOSEND-off |
| Pre-flight defaults pack (promote seeder/check patterns) | ½ day | Price / external-API drift silently corrupts unit economics with green checks (H9 mitigates) |