Stage 11

Monitoring, incident response, and rollout

A first-class system fails safely, explains itself clearly, and improves after each incident.

Challenge we are solving

Most expensive failures are not obvious logic bugs. They are stale data, silent drift, broken heartbeats, poor rollouts, and state divergence that nobody notices in time.

What this stage does

Collects metrics, logs, traces, alerts. Pauses unsafe scopes automatically. Moves change through the rollout ladder — replay → shadow → paper → canary → broader live.

Why this stage exists

It turns a clever system into an operable one.

Flow

Metricsfill ratio · slippage err · drift

Alertsstale book · heartbeat · canary delta

Auto-pausescope freeze

Rolloutreplay → shadow → paper → canary → live

What the backend should expose

data freshness, gap counts
fill ratio, slippage error (predicted vs real)
model drift (canary delta · baseline delta)
heartbeat health (per bot, per market)
incident queue (open · acknowledged · resolved)
rollout stage (replay · shadow · paper · canary · live)
rollback controls (per scope, audit-logged)

Maths we expect here

Every formula below is implemented in packages/polytraders-bots/ or packages/polytraders-runner/. Treat the worked example as the unit-test sanity check you should be able to reproduce locally.

SLO compliance

\[SLO = \frac{healthy\_events}{total\_events} \ge target\]

Symbol	Meaning	Units / range
\(healthy_events\)	Events within the success criteria (fresh, in-budget, on-spec)	count
\(total_events\)	All eligible events in the window	count
\(target\)	SLO target (e.g. 0.995)	0..1

worked example\[\tfrac{99{,}640}{100{,}000}=0.9964 < 0.995 \;\Rightarrow\; raise\;SLO\_BURN\;alert\]

Each SLO has an owner (registry, ingestion, OMS, etc.). Burn alerts trigger automatic pauses on the affected scope.

Staleness alert

\[\text{alert} \iff stale\_duration > threshold(scope)\]

Symbol	Meaning	Units / range
\(stale_duration\)	Time since last accepted update for the scope	seconds
\(threshold\)	Per-scope SLO (book: 2s, registry: 10s, model: 24h)	seconds

worked example\[book\;stale = 4{,}600\,\text{ms} > 2{,}000 \;\Rightarrow\; pause\;trading\;on\;market\]

Canary delta vs. baseline

\[canary\_delta = \mathrm{mean}(metric_{new}) - \mathrm{mean}(metric_{baseline})\]

Symbol	Meaning	Units / range
\(metric_new\)	Metric from canary cohort	depends on metric (e.g. Brier, slip RMSE)
\(metric_baseline\)	Metric from baseline cohort over the same window	same

worked example\[\text{Brier}_{new}=0.179,\;\text{Brier}_{baseline}=0.181 \;\Rightarrow\; canary\_delta=-0.002 \;\text{(better; promote eligible)}\]

Canary must not be worse than baseline on any leading metric. Promotion uses a fixed window and a one-sided significance bar.

Drift score (model)

\[drift = \mathrm{KL}\big(P_{features}^{recent}\;\|\;P_{features}^{training}\big)\]

Symbol	Meaning	Units / range
\(P^{recent}\)	Feature distribution over the last 24h	distribution
\(P^{training}\)	Feature distribution at training time	distribution

worked example\[drift = 0.09 < 0.20\;threshold \;\Rightarrow\; healthy\]

When drift exceeds threshold, stage 11 forces a re-evaluation against stage 10 before any further promotion.

How a developer codes this stage

Reference TypeScript implementation lives in packages/polytraders-* at the repository root. Stage owners maintain these files — read them before writing new code.

packages/polytraders-bots/src/governanceReference monitoring bots: freshnesssentinel, driftdetector, rolloutgatekeeper.
packages/polytraders-runner/src/pipeline.jsPipeline emits health events at every stage — monitoring consumes them, not the raw bots.
packages/polytraders-contracts/src/ReportEnvelope.tsIncident envelopes use the same schema as trade envelopes — one log, one auditor.

See it in the platform mock

The platform mock is the source of truth for what each stage's UI exposes. Open these alongside the code references.

↗Ops dashboard/mock-app/ops-dashboard.html ↗Incident board/mock-app/incidents.html ↗Auto-paused scopes/mock-app/risk-killswitches.html ↗Stage 11 — pipeline view/ops/monitoring.html ↗Runbooks/ops/incident-playbooks.html

Reason codes emitted at this stage

GOV_*GOV — governance, accounting, replay, monitoring
RISK_*RISK — guardrails, caps, kill-switch

Hover or tap any reason code on this page (or anywhere on the site) to see its canonical short description. Full registry: /standards/reason-codes.