Monitoring

What we watch, where it surfaces, and what alerts a human at 03:00.

Dashboards

System health — per-bot heartbeat, mode, error rate, decision rate. Heartbeat older than 60 s is a P1.
Risk dashboard — intraday and weekly drawdown, exposure by market, exposure by theme, KillSwitch state.
Execution dashboard — intent-to-order latency, fill rate, partial-fill rate, realised slippage vs estimate, builderCode coverage.
Reconcile dashboard — on-chain vs internal state diff, oldest unreconciled event.
Data dashboard — per-source freshness, per-source error rate, MarketQualityRanker top decile coverage.

Severity	Examples	Page who	Within
P0	KillSwitch active, reconcile mismatch, wallet drained	Risk on-call + Security on-call + Head of Engineering	1 minute
P1	Heartbeat lost, API degraded hard, drawdown warning	Risk on-call	5 minutes
P2	Slippage breach, fee overrun, config drift	Trading-ops	30 minutes
P3	Single bot error rate above 1%, model drift warning	Owning team	Next business day

Free-text log lines. They produce noise without action. The rule is: if it's worth alerting on, it's worth a reason code in the registry.