← All stages
· stage 11 of 11
Stage 11
Monitoring, incident response, and rollout
A first-class system fails safely, explains itself clearly, and improves after each incident.
Challenge we are solving
Most expensive failures are not obvious logic bugs. They are stale data, silent drift, broken heartbeats, poor rollouts, and state divergence that nobody notices in time.
roadmap position
Where this stage sits in the build plan
This runtime stage is advanced by 2 build phases . The earliest is the one that first wires it; later phases extend it. See the full plan for the exit gate, infra tasks, and bots in each phase.
Open the plan at Phase 3 →
What this stage does
Collects metrics, logs, traces, alerts. Pauses unsafe scopes automatically. Moves change through the rollout ladder — replay → shadow → paper → canary → broader live.
Why this stage exists
It turns a clever system into an operable one.
Flow
Metrics fill ratio · slippage err · drift
→ Alerts stale book · heartbeat · canary delta
→ Auto-pause scope freeze
→ Rollout replay → shadow → paper → canary → live
What the backend should expose
data freshness, gap counts
fill ratio, slippage error (predicted vs real)
model drift (canary delta · baseline delta)
heartbeat health (per bot, per market)
incident queue (open · acknowledged · resolved)
rollout stage (replay · shadow · paper · canary · live)
rollback controls (per scope, audit-logged)
Maths we expect here
Every formula below is implemented in packages/polytraders-bots/ or packages/polytraders-runner/. Treat the worked example as the unit-test sanity check you should be able to reproduce locally.
\[SLO = \frac{healthy\_events}{total\_events} \ge target\]
Symbol Meaning Units / range \(healthy_events\) Events within the success criteria (fresh, in-budget, on-spec) count \(total_events\) All eligible events in the window count \(target\) SLO target (e.g. 0.995) 0..1
worked example \[\tfrac{99{,}640}{100{,}000}=0.9964 < 0.995 \;\Rightarrow\; raise\;SLO\_BURN\;alert\]
Each SLO has an owner (registry, ingestion, OMS, etc.). Burn alerts trigger automatic pauses on the affected scope.
\[\text{alert} \iff stale\_duration > threshold(scope)\]
Symbol Meaning Units / range \(stale_duration\) Time since last accepted update for the scope seconds \(threshold\) Per-scope SLO (book: 2s, registry: 10s, model: 24h) seconds
worked example \[book\;stale = 4{,}600\,\text{ms} > 2{,}000 \;\Rightarrow\; pause\;trading\;on\;market\]
3 Canary delta vs. baseline \[canary\_delta = \mathrm{mean}(metric_{new}) - \mathrm{mean}(metric_{baseline})\]
Symbol Meaning Units / range \(metric_new\) Metric from canary cohort depends on metric (e.g. Brier, slip RMSE) \(metric_baseline\) Metric from baseline cohort over the same window same
worked example \[\text{Brier}_{new}=0.179,\;\text{Brier}_{baseline}=0.181 \;\Rightarrow\; canary\_delta=-0.002 \;\text{(better; promote eligible)}\]
Canary must not be worse than baseline on any leading metric. Promotion uses a fixed window and a one-sided significance bar.
\[drift = \mathrm{KL}\big(P_{features}^{recent}\;\|\;P_{features}^{training}\big)\]
Symbol Meaning Units / range \(P^{recent}\) Feature distribution over the last 24h distribution \(P^{training}\) Feature distribution at training time distribution
worked example \[drift = 0.09 < 0.20\;threshold \;\Rightarrow\; healthy\]
When drift exceeds threshold, stage 11 forces a re-evaluation against stage 10 before any further promotion.
How a developer codes this stage
Reference TypeScript implementation lives in packages/polytraders-* at the repository root. Stage owners maintain these files — read them before writing new code.
packages/polytraders-bots/src/governanceReference monitoring bots: freshnesssentinel, driftdetector, rolloutgatekeeper. packages/polytraders-runner/src/pipeline.jsPipeline emits health events at every stage — monitoring consumes them, not the raw bots. packages/polytraders-contracts/src/ReportEnvelope.tsIncident envelopes use the same schema as trade envelopes — one log, one auditor.
See it in the platform mock
The platform mock is the source of truth for what each stage's UI exposes. Open these alongside the code references.
Reason codes emitted at this stage
GOV_*GOV — governance, accounting, replay, monitoring RISK_*RISK — guardrails, caps, kill-switch Hover or tap any reason code on this page (or anywhere on the site) to see its canonical short description. Full registry: /standards/reason-codes .
← Stage 10 Replay, backtesting, and validation Recap → One trade, traced end to end