Build Plan

Stand: 2026-05-31 · Für: Sebastian Friedrich / plan.ai · Status: pre-code, foundation phase

This is the executable build plan for the system specified in the Master Playbook. The playbook says what to build and why; this plan says in what order, by whom (human vs. coding agent), and with what exit criteria. It was produced by a four-voice LLM council (Architect / Skeptic / Pragmatist / Critic) with web research via codex.

One-line strategy (council-unanimous): Build a thin vertical slice through the whole pipeline on a single hardcoded ticker, with every deterministic safety gate real and adversarially tested before the first paper order, and all LLM “brains” stubbed. Add intelligence only after the safety spine is proven.

v1 — Goal & Scope (set 2026-05-31, council-vetted)

v1 goal (one line): Prove the machine, not the edge — a trustworthy, observable, leak-free paper-trading loop that survives two weeks unattended (long-only baseline; multipliers added incrementally) on real point-in-time market data.

Full statement. By the end of a two-week unattended run on real market data, the Trading-Agency is a believable, safe, fully-auditable long-only paper trader: it ingests real point-in-time data, lets the real LLM brains propose, routes every candidate through the complete deterministic safety stack + mandatory HITL, places paper orders, monitors/exits them, and records every decision and every BLOCK as a trace-linked, replayable audit record — with zero ungated orders, zero safety violations, and a verified absence of look-ahead / leakage. Alpha / profitability is explicitly NOT a v1 goal.

v1 success criteria (all must hold):

Runs unattended for the two-week window (~10 trading days): data -> signal -> gate -> HITL -> paper order -> monitor/exit -> audit, with clean crash/outage recovery and no manual babysitting.
100% of orders pass the deterministic gates (long-only, position caps, kill-switch, paper double-gate, ethics, stale-data) and HITL; zero ungated orders, zero bypasses, zero double-orders.
Every action — trade and BLOCK — is trace-linked in SQLite + append-only JSONL and replayable from the exact point-in-time input snapshot.
Entries are human-gated (signed token, timeout = HOLD) — the v1 baseline; autonomy is earned as the eval proves the agent.
Clean recovery from at least one induced failure (data-feed/API outage, process restart); daily position reconciliation vs. broker.
Verified no look-ahead / leakage / survivorship bias (point-in-time enforced; running bar excluded) — the Architect’s top risk and a classic silent killer: a v1 that looks perfect but leaks is a failed v1.
End-of-run report: proposals, fills, slippage vs. expected, gate/audit coverage = 100%.

Explicitly OUT of v1: profitability / alpha claims · live money · shorts / leverage / options / crypto · multi-broker · portfolio optimization · strategy auto-tuning or self-modification · broad universe expansion · backtest-based go/no-go · any relaxation of HITL or audit to force a trade.

Biggest v1 risk: mistaking “realistic paper trading” for “alpha validation,” and silent look-ahead / leakage / broker-state drift producing a loop that looks perfect on paper but is broken (non-reproducible) for real capital. Mitigation: label v1 “operational — alpha unproven”; make leakage-absence a hard success criterion (#6); v2 is the edge-proving milestone.

Council provenance (2026-05-31): Claude Opus 4.8 (synthesizer), OpenAI codex, Cursor Gemini 3.1 Pro, Cursor GPT-5.5. Consensus (codex, GPT-5.5, Architect): “prove the machine, not the edge” — a two-week paper loop on the safety + audit spine, alpha and scope-creep excluded. Dissent (Gemini, Critic): scoped v1 smaller — a single stubbed-brain trade that proves the deterministic spine — and named silent alpaca-mcp-server failure (e.g. deprecated PDT fields) as the top risk. The session’s MCP decision (direct broker adapter + watchdog->HOLD, Decision 2 below) directly mitigates that. The two views reconcile by sequencing: Gemini’s stubbed-spine proof is Phase 1; the two-week real-brain loop is the v1 destination. Top risk named by each voice: realism-mistaken-for-alpha + premature scope (codex); scope creep (GPT-5.5); MCP silent failure (Gemini); look-ahead / leakage (Architect).

Session decisions (2026-05-31)

Eval window: two weeks unattended paper run (a lab eval — NOT the playbook’s >=3-month live-go bar, which applies only if/when real money is ever considered).
alpaca-mcp-server SPOF -> best practice: take the MCP off the capital-critical path — orders/exits go through a direct, deterministic Alpaca SDK/REST broker adapter (right of place_order); the MCP server serves LLM-facing/read tools only, is supervised (auto-restart), and a watchdog defaults the system to HOLD + Telegram-escalates on MCP failure. (Refines Role 7 / playbook F.1 #2; matches the council’s “direct broker interface” verdict.)
replace_order wash-trade edge case: verify in paper with real 403-monitoring before it can matter. (Confirmed — proceed.)
Scope = paper-only, real data, as realistic as possible. No live execution -> the FMA “gewerblicher Handel” question is moot for v1; realism effort goes into the Role 2 point-in-time data layer.
Framing: plan.ai labs experiment. universe_policy.yaml PLTR/Planet-Labs stays at conservative exclude by default, treated as an experiment knob.

Direction extension — Leverage & Cockpit (2026-05-31, council-vetted)

A deliberate plan.ai-labs mandate expansion (supersedes playbook E.1 for the lab):

All multipliers — margin + leveraged ETFs (2x/3x) + long options (calls/debit spreads), up to broker capability (bounded by deterministic caps below it) — paper-first, architected for an easy live switch. Implemented as a deterministic Risk/Exposure Engine (extends Role 6), not a loosened gate: caps are deterministic, portfolio-level, drawdown-halting, and below the broker max. Enabled one instrument at a time, only after the engine + accounting rework are green (Phase 2). Full spec: Leverage & Multipliers.
Critical amendment: under leverage, HITL gates ENTRIES only; exits / de-leverage / drawdown-halt fire automatically without a human tap (approval latency = risk). Resolves F.2 #10.
Fail-safe, not just fail-closed: every leveraged position carries broker-side resting protective orders (GTC stop/bracket) placed atomically at entry, so it survives a total stack outage — fail-closed stops new risk, fail-safe protects existing leveraged risk. (Closes the council’s top leverage gap; see Leverage & Multipliers.)
Frontend: public Landing (static) + private Cockpit (SSR/islands, authenticated) that observes + approves only, outside place_order; reads state Hermes writes; a second HITL channel. Build Landing now, Cockpit after RiskState exists. Full spec: Cockpit & Frontend.
Unanimous council risk: max leverage + options + unproven edge = fast, correlated ruin; eval must gate on drawdown with leverage ON, start tiny.

0. How to use this document

If you are a human (Sebastian): Work top-to-bottom. Each phase has a Human decisions box — those are yours and block the phase. Do not let an agent guess them. The 🔴 items in §1.4 must be answered before live trading is ever discussed.

If you are a coding agent: Read §1 (Ground rules) in full before writing any code. Then execute the current phase’s tasks. Obey the Definition of Done literally — a phase is not complete until every checkbox and every test gate passes. When uncertain, stop and ask; never relax a non-negotiable to make progress. Your default action under any ambiguity, error, or missing input is HOLD / no-trade, exactly as the playbook mandates (Teil E.8).

Source of truth: The playbook is canonical. Where this plan and the playbook disagree on a number or schema, the playbook wins and you flag the conflict. Where they disagree on build order, this plan wins.

1. Ground rules (read before any code)

1.1 The one rule that overrides all others: the `place_order` boundary

   LEFT of place_order              │            RIGHT of place_order
   (LLMs may THINK / PROPOSE)       │      (deterministic, unit-tested Python ONLY)
   ───────────────────────────────  │  ───────────────────────────────────────────
   R3 screening, R4 debate,         │   R6 sizing, R7 execution gate, R8 exit logic,
   R5 optional judge, R2 sentiment, │   R9 compliance gate, R11 orchestration gates,
   finance-skills (Teil D)          │   R10 computation, the audit writer

No LLM output ever reaches the broker without passing through deterministic gates. “Safety is code, not prompt.” (Playbook Teil C.4 / E.5.) Raw broker order tools are internal_only and invisible to LLMs — only the gated place_order is exposed (use ALPACA_TOOLSETS to enforce this).

1.2 Non-negotiables — NEVER stub, weaken, or defer these (Playbook Teil E)

#	Principle	Concrete enforcement
1	Directional policy	`place_order` blocks `side != "buy"` as its first check. v1 baseline is long-only; shorting stays out, but multipliers (margin, leveraged ETFs, defined-risk long options) are added incrementally via the deterministic Risk/Exposure Engine — see Leverage & Multipliers.
2	Paper-first double gate	`ALPACA_PAPER_TRADE=True` AND `LIVE_TRADING_ENABLED=false` must both hold. R7 re-verifies `account_type` from a fresh live API response, never the cached `.env`.
3	Mandatory HITL	Fresh signed Telegram approval token (≤15 min) before every order. *Timeout = HOLD; auto-approval is earned, not default* (graduated autonomy via the eval).
4	Point-in-time	Features use only `available_at <= decision_as_of`; exclude the running bar; `.shift(1)` indicators.
5	Ethics exclusion	Zero-tolerance (weapons/fossil/tobacco/gambling) enforced at R1 + R6 + R7 (+ R10 assert). Single source: `universe_policy.yaml`.
6	Audit trail	SQLite + append-only JSONL, fail-closed: a decision is invalid if its audit write fails. Every BLOCK is logged too.
7	Graceful failure → HOLD	Any exception/timeout/stale data/budget-stop/kill-switch → `HOLD`/`no_trade`. No “best-effort buy”, no retry-to-order loops.
8	Pinned models	Dated model versions, temperature 0 where determinism is required. Log `model_pin`/`prompt_hash`/`code_git_sha` per decision.
9	Global kill-switch	File `/opt/trading/KILL_SWITCH_ACTIVE`, checked as the first step of every trading job (works even if DB/network is down).

Agent rule: if a task seems to require violating any row above, it is the task that is wrong. Stop and surface it.

1.3 Foundational stack — pinned dependencies (verify before locking)

Web research (codex, ~2026-05) located the real upstream projects. Pin these versions in a lockfile; re-verify the exact version/commit on first install — treat the numbers below as “last reported”, not gospel.

Component	Repo / source	License	Reported version	Role	Notes
Hermes Agent (orchestrator)	`github.com/NousResearch/hermes-agent`	MIT	v0.15.2 (2026-05-29)	R11 + all crons/subagents	The agent framework (not the Hermes LLM models). Primitives: tools, `delegate_task` (subagent), `cronjob`, MCP. Fast-moving → pin hard.
alpaca-mcp-server (execution + data)	`github.com/alpacahq/alpaca-mcp-server` (PyPI `alpaca-mcp-server`)	MIT	2.0.2 (2026-05-20)	R2 data, R7 execution	V2 ≠ V1 (breaking). `ALPACA_PAPER_TRADE` defaults `true`. Use `ALPACA_TOOLSETS` to expose only what each role needs.
ai-hedge-fund (persona screening)	`github.com/virattt/ai-hedge-fund`	MIT	2026.5.14	R3	Research-only, does not trade. CLI/subprocess-invokable. Needs LLM key + `FINANCIAL_DATASETS_API_KEY`.
TradingAgents (Bull/Bear/Risk debate)	`github.com/TauricResearch/TradingAgents`	Apache-2.0	v0.2.5 (2026-05-11)	R4	LangGraph multi-agent. Python: `TradingAgentsGraph(...).propagate(ticker, date)`. Nondeterministic → treat as proposal only.
financialdatasets.ai (fundamentals)	`docs.financialdatasets.ai`	Commercial	—	R1, R2	Segmented financials: `GET /financials/income-statements/segments`, auth `X-API-KEY`. Low tiers ≈ 3yr history → matters for backtests.
Alpaca Market/Trading API	`docs.alpaca.markets`	—	—	R2, R7	⚠️ PDT fields (`pattern_day_trader`, `daytrade_count`, `daytrading_buying_power`) removed 2026-07-06 (deprecated 06-04). Never gate on them — use `buying_power`. The spec’s “IMD” = FINRA Intraday Margin Rule, effective 2026-06-04.

1.4 🔴 Human-only decisions that block live trading (from Playbook Teil F, P0)

These are not for agents. Sebastian must answer them, in writing, before any live-go conversation:

Accept a statistically under-powered first live-go with reduced initial capital? (MinTRL ~750 days is unreachable.)
alpaca-mcp-server is a single point of failure on the whole order/exit path — define the healthcheck/restart SLA.
Confirm a replace_order on a pending entry leg is never flagged as a MAR wash-trade while the stop leg is live (verify with real 403-monitoring in paper).
Legal: at what frequency/volume does this cross into “gewerblicher Handel” (FMA)? Proprietary-only is non-negotiable.
PLTR / Planet-Labs conditional-include line: set explicitly in universe_policy.yaml (default: conservative EXCLUDE).

2. Build phases

Each phase: goal → tasks (owner) → Definition of Done (DoD) → test gate. Do not start phase N+1 until phase N’s DoD is green.

Phase 0 — Foundation & safety primitives (no trading yet)

Goal: A repo, a reproducible environment, and the deterministic safety primitives that every later phase depends on — plus the missing foundational guide written as a contract.

Tasks:

(Human) Provision the hardened Ubuntu VM; install Hermes Agent (pinned); create Alpaca paper keys; create the Telegram bot; create financialdatasets.ai + LLM API keys. Store all secrets via Docker secrets / env — never in the repo.
(Agent) Scaffold repo: pyproject.toml with pinned deps (§1.3), pre-commit, pytest, ruff, mypy, a lockfile. Python per upstream constraints (TradingAgents needs 3.13; ai-hedge-fund uses Poetry — isolate it as a subprocess/venv, don’t merge dep trees).
(Agent) Write Hermes_Trading_Build_Guide.md as a contract-only doc (the missing foundation): the orchestrator primitives, the place_order boundary rules, and the concrete failure semantics (what HOLD, fail-closed, and kill-switch each do, step by step). Keep it minimal; do not invent requirements beyond the playbook. (Council: Skeptic + Critic — this file may encode the only spec of failure semantics; it gates the safety layer.)
(Agent) Implement + unit-test the safety primitives as standalone modules: kill_switch.check(), double_gate.assert_paper() (reads fresh API account_type), audit.write() (SQLite + append-only JSONL, fail-closed), config loader with policy_hash/config_version, structured logging with trace_id.
(Agent) Define all pipeline artifacts as typed schemas (pydantic) from Playbook Teil C: universe_db.json, feature_snapshot.v1, screen_result, debate_thesis, reconciled_signal, sized_order, ApprovedOrderRequest, ExecutionResult, position_health. Use the canonical names (Teil C.2) — no synonyms.

DoD / test gate:

kill_switch, double_gate, audit, schemas: 100% unit-test coverage, all green.
Adversarial tests pass: corrupt audit write → decision rejected; kill-switch file present → every job aborts at step 1; double-gate with LIVE_TRADING_ENABLED unset/true → blocked.
Hermes_Trading_Build_Guide.md exists and the human has signed off on its failure-semantics section.

Phase 1 — Vertical slice MVP: stubbed-brain paper loop

Goal: One real paper order placed end-to-end, through the full gate stack, on a single hardcoded ticker, with all LLM brains stubbed. This proves the spine and the artifact contracts — not alpha.

What is REAL in slice #1: double-gate, kill-switch, long-only check, ethics gate (real enforcement against a hardcoded denylist), HITL Telegram approval, place_order via alpaca-mcp-server (paper), fail-closed audit, ExecutionResult, reading back position_health.

What is STUBBED in slice #1: R1 universe (one symbol), R2 data (canned feature_snapshot), R3 screening + R4 debate + R5 reconcile (return a canned reconciled_signal=buy JSON), R6 sizing may be a fixed small qty initially then real in Phase 2.

Tasks (Agent):

Wire the canonical pipeline end-to-end with stubs left of place_order, real gates right of it.
Implement place_order as the single exposed execution tool: order of checks = kill-switch → long-only (side=="buy") → double-gate (fresh account_type) → ethics → fresh HITL token (≤15 min) → audit → submit GTC bracket → audit result.
Implement the Telegram HITL flow: request → approve/deny → signed token; timeout → HOLD; warn on rubber-stamp (approve <5 s).

DoD / test gate:

A canned buy proposal results in a real paper fill on the test ticker, with a complete trace_id-linked audit chain (proposal → gates → HITL token → fill).
Adversarial tests (Critic-mandated) all pass:
- Attempt a live order (LIVE_TRADING_ENABLED=true in code path) → blocked + logged.
- Telegram bot down / no response → order HOLDs, never auto-approves.
- side="sell"/short attempt → blocked first.
- Kill-switch toggled mid-loop → loop aborts.
The slice is explicitly labelled “plumbing-proven, alpha-unproven.” (Council: the 30-trade live-eval clock does NOT start here.)

Phase 2 — Harden the deterministic right side (R6, R7, R8, R9)

Goal: Everything right of place_order becomes the real spec implementation, fully tested. Still stubbed brains.

Tasks (Agent):

R6 Risk Manager: real ATR-1% sizing, position caps (10%/30%/80%), correlation limit, stop computation, hard rejection_reason on any limit breach. R6 is the single source of sizing.
R7 Execution Trader: real GTC bracket orders, idempotency via client_order_id, spread re-check (<0.1% hard gate at order time), HARD_MAX_NOTIONAL notbremse, slippage capture. Integer qty only (brackets don’t support fractionals).
R8 Portfolio Monitor & Exit: Chandelier stop ratchet, max-hold, weekend/earnings gap protection, ExitRequest generation. Sells ≤ existing long qty.
R9 Compliance & Ethics: HITL workflow ownership, MAR/wash-trade checks, audit-record + tax-log, the 15-min token issuance.

DoD / test gate: Each role has unit tests + property tests for its gates (e.g. sizing never exceeds caps; R7 never lets a non-buy through; exit never sells more than held). Fault-injection tests for each. Staggered escalation verified: 8% (R5 HITL) < 10% (R6 cap) < 12% (R6 assert).

Phase 3 — Real data & universe (R2, R1)

Goal: Replace data/universe stubs with the real, point-in-time, auditable layer.

Tasks (Agent): R2 feature_snapshot from Alpaca (primary) + financialdatasets.ai (fundamentals); yfinance fallback for plausibility only; full PiT filtering + data-quality gates → HOLD_DATA_ISSUE on any failure. R1 universe build (ethics → theme → liquidity gates), immutable PiT snapshots, exclusion_audit.sqlite, universe_policy.yaml as single source of exclusion lists. (Human) decide PLTR/Planet-Labs line.

DoD / test gate: Ethics recall = 100% on the known-dirty test suite (AVAV/SAIC/BAH/LMT/XOM…) — every false-negative is a critical bug. Build determinism: same input/day → same policy_hash. No look-ahead: a test that feeds future data must produce HOLD.

Phase 4 — Real brains, left of the boundary (R3, R4, R5)

Goal: Swap stubs for real intelligence — strictly as proposals. The live-eval clock starts when this phase ships.

Tasks (Agent): R3 wrap ai-hedge-fund as a headless subprocess producing screen_result (5 personas, T=0, grounded). R4 wrap TradingAgents per shortlisted candidate producing debate_thesis. R5 deterministic reconciler (supermajority ≥2/3 + calibration) → reconciled_signal; optional LLM veto-judge only. Adopt finance-skills (Teil D) stripped (triggers + code removed, static-pinned, PiT-wrapped) — never touching place_order.

DoD / test gate: All LLM outputs are schema-validated and pass through the Phase-2 gates unchanged. Models pinned + temperature-0 where required. Look-ahead guard: LLMs receive only the snapshot + “say INSUFFICIENT_DATA otherwise.” Audit logs model_pin/prompt_hash/input_snapshot_hash per decision.

Phase 5 — Validator & orchestration (R10, R11)

Goal: The epistemic gate and the operational nervous system.

Tasks (Agent): R10 backtest harness (direction='longonly', survivorship-aware), overfitting trias (PBO/RC), go_no_go_verdict — computation is deterministic, council read-only optional. R11 Hermes crons (UTC server + exchange_calendars, no critical jobs 01–03 ET), budget/cost guards, secrets, watchdog, 4-level kill-switch escalation, per-decision reproducibility manifest.

DoD / test gate: Backtests run survivorship-free on archived PiT snapshots; assert_ethics_clean runs before every backtest. All crons TZ/DST-tested. Kill-switch escalation L1→L4 tested; deterministic monitoring/exit checks keep running under budget-block.

Phase 6 — Paper evaluation → Go/No-Go

Goal: Accumulate the evidence the playbook requires before live is even discussed. No code goal — this is an observation + decision phase.

Gate to live (all required, Playbook F.1): ≥3 months paper · ≥30 completed trades · passed overfitting trias · 0 secret leaks · 0 ungated orders · 0 budget violations · passed fault-injection · explicit human Go/No-Go. Then, and only then, the manual 3-step live flip — with reduced initial capital per §1.4.1.

3. Cross-cutting rules for coding agents (every phase)

Never stub anything right of place_order. Stubbing the brains is free and safe; stubbing a gate breaks everything.
A too-permissive stub hides schema bugs (Pragmatist warning). Make stubs return realistically constrained data that the real agent could plausibly produce.
Every order attempt — including every BLOCK — is audited with trace_id + data_snapshot_hash + reason code.
Write the test before/with the gate. Untested safety code is worse than none (Critic). Each gate needs an adversarial test that tries to defeat it.
Pin everything — model versions, dep versions, container digests. No runtime npx/unpinned pulls (supply-chain).
When blocked or uncertain, HOLD and ask the human. Do not improvise around a non-negotiable.

4. Definition of “the agency exists”

The agency is built when Phases 0–5 are green and Phase 6 is running (paper loop live, eval clock ticking, full audit accumulating). It is trusted for live only when the Phase-6 gate passes and the human signs the Go/No-Go. Until then it is, by design and by mandate, a paper-only system that defaults to HOLD.

Build plan derived from Agency_Master_Playbook.md (Council-Edition, 2026-05-31) via a four-voice decision council + codex web research. Internal research/engineering only — not investment, legal, or tax advice. Proprietary trading, Austria/EU (WAG 2018, FMA, MiFID II, MAR).