Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board

← All stages · stage 11 of 11

Stage 11

Monitoring, incident response, and rollout

A first-class system fails safely, explains itself clearly, and improves after each incident.

Challenge we are solving

Most expensive failures are not obvious logic bugs. They are stale data, silent drift, broken heartbeats, poor rollouts, and state divergence that nobody notices in time.

What this stage does

Collects metrics, logs, traces, alerts. Pauses unsafe scopes automatically. Moves change through the rollout ladder — replay → shadow → paper → canary → broader live.

Why this stage exists

It turns a clever system into an operable one.

Flow

Metricsfill ratio · slippage err · drift
Alertsstale book · heartbeat · canary delta
Auto-pausescope freeze
Rolloutreplay → shadow → paper → canary → live

What the backend should expose

  • data freshness, gap counts
  • fill ratio, slippage error (predicted vs real)
  • model drift (canary delta · baseline delta)
  • heartbeat health (per bot, per market)
  • incident queue (open · acknowledged · resolved)
  • rollout stage (replay · shadow · paper · canary · live)
  • rollback controls (per scope, audit-logged)

Maths we expect here

Every formula below is implemented in packages/polytraders-bots/ or packages/polytraders-runner/. Treat the worked example as the unit-test sanity check you should be able to reproduce locally.

1

SLO compliance

\[SLO = \frac{healthy\_events}{total\_events} \ge target\]
SymbolMeaningUnits / range
\(healthy_events\)Events within the success criteria (fresh, in-budget, on-spec)count
\(total_events\)All eligible events in the windowcount
\(target\)SLO target (e.g. 0.995)0..1
worked example\[\tfrac{99{,}640}{100{,}000}=0.9964 < 0.995 \;\Rightarrow\; raise\;SLO\_BURN\;alert\]

Each SLO has an owner (registry, ingestion, OMS, etc.). Burn alerts trigger automatic pauses on the affected scope.

2

Staleness alert

\[\text{alert} \iff stale\_duration > threshold(scope)\]
SymbolMeaningUnits / range
\(stale_duration\)Time since last accepted update for the scopeseconds
\(threshold\)Per-scope SLO (book: 2s, registry: 10s, model: 24h)seconds
worked example\[book\;stale = 4{,}600\,\text{ms} > 2{,}000 \;\Rightarrow\; pause\;trading\;on\;market\]
3

Canary delta vs. baseline

\[canary\_delta = \mathrm{mean}(metric_{new}) - \mathrm{mean}(metric_{baseline})\]
SymbolMeaningUnits / range
\(metric_new\)Metric from canary cohortdepends on metric (e.g. Brier, slip RMSE)
\(metric_baseline\)Metric from baseline cohort over the same windowsame
worked example\[\text{Brier}_{new}=0.179,\;\text{Brier}_{baseline}=0.181 \;\Rightarrow\; canary\_delta=-0.002 \;\text{(better; promote eligible)}\]

Canary must not be worse than baseline on any leading metric. Promotion uses a fixed window and a one-sided significance bar.

4

Drift score (model)

\[drift = \mathrm{KL}\big(P_{features}^{recent}\;\|\;P_{features}^{training}\big)\]
SymbolMeaningUnits / range
\(P^{recent}\)Feature distribution over the last 24hdistribution
\(P^{training}\)Feature distribution at training timedistribution
worked example\[drift = 0.09 < 0.20\;threshold \;\Rightarrow\; healthy\]

When drift exceeds threshold, stage 11 forces a re-evaluation against stage 10 before any further promotion.

How a developer codes this stage

Reference TypeScript implementation lives in packages/polytraders-* at the repository root. Stage owners maintain these files — read them before writing new code.

  • packages/polytraders-bots/src/governanceReference monitoring bots: freshnesssentinel, driftdetector, rolloutgatekeeper.
  • packages/polytraders-runner/src/pipeline.jsPipeline emits health events at every stage — monitoring consumes them, not the raw bots.
  • packages/polytraders-contracts/src/ReportEnvelope.tsIncident envelopes use the same schema as trade envelopes — one log, one auditor.

See it in the platform mock

The platform mock is the source of truth for what each stage's UI exposes. Open these alongside the code references.

Reason codes emitted at this stage

  • GOV_*GOV — governance, accounting, replay, monitoring
  • RISK_*RISK — guardrails, caps, kill-switch

Hover or tap any reason code on this page (or anywhere on the site) to see its canonical short description. Full registry: /standards/reason-codes.