Pipeline view for Stage 11 — Monitoring, incident response & rollout.
Monitoring
What we watch, where it surfaces, and what alerts a human at 03:00.
Dashboards
- System health — per-bot heartbeat, mode, error rate, decision rate. Heartbeat older than 60 s is a P1.
- Risk dashboard — intraday and weekly drawdown, exposure by market, exposure by theme, KillSwitch state.
- Execution dashboard — intent-to-order latency, fill rate, partial-fill rate, realised slippage vs estimate, builderCode coverage.
- Reconcile dashboard — on-chain vs internal state diff, oldest unreconciled event.
- Data dashboard — per-source freshness, per-source error rate, MarketQualityRanker top decile coverage.
Alerts and escalation
| Severity | Examples | Page who | Within |
|---|---|---|---|
| P0 | KillSwitch active, reconcile mismatch, wallet drained | Risk on-call + Security on-call + Head of Engineering | 1 minute |
| P1 | Heartbeat lost, API degraded hard, drawdown warning | Risk on-call | 5 minutes |
| P2 | Slippage breach, fee overrun, config drift | Trading-ops | 30 minutes |
| P3 | Single bot error rate above 1%, model drift warning | Owning team | Next business day |
Metrics every bot must emit
decisions_total{reason_code,verdict}decision_latency_ms{p50,p95,p99}heartbeat_age_mserrors_total{reason_code}config_version(gauge)mode(gauge with label)
What we deliberately don't alert on
Free-text log lines. They produce noise without action. The rule is: if it's worth alerting on, it's worth a reason code in the registry.