Observability standards
Every bot ships with the same observability surface. If a bot is in production and you cannot find its metrics, logs, and alerts in under 30 seconds, the bot is not done.
The four pillars
| Pillar | Standard | Storage |
|---|---|---|
| Metrics | Prometheus exposition format. Counter, gauge, histogram only. | Prometheus 30 days, Mimir 13 months. |
| Logs | Structured JSON, one event per line. Mandatory keys: ts, level, bot_id, intent_id (when applicable), reason_code (for non-INFO). | Loki 14 days hot, S3 archive 1 year. |
| Traces | OpenTelemetry. Every OrderIntent gets a trace; every bot consult is a span. | Tempo 7 days. |
| Audits | Every config change, manual override, and reconciliation is a GovernanceLog entry. | Postgres + S3 evidence URI; retained 7 years. |
Mandatory metrics — every bot exposes these
# Counter
polytraders_bot_invocations_total{bot_id, decision, reason_code}
# Counter
polytraders_bot_errors_total{bot_id, error_class}
# Histogram (seconds)
polytraders_bot_latency_seconds_bucket{bot_id, le}
# Gauge
polytraders_bot_inputs_stale{bot_id, input_name} # 1 if any required input is past staleness threshold
# Gauge
polytraders_bot_dependency_unavailable{bot_id, dep} # 1 if a declared dependency is failing
Bot-specific metrics
Each bot may add its own metrics in its Metrics & logs section. Conventions:
- Prefix with
polytraders_<layer>_<bot>_. - Use seconds for time, bytes for size, ratios as 0..1 unitless.
- Cardinality budget per metric: 1000 series per bot. Exceed at your peril.
- Never put market_id or wallet address in a label — too high-cardinality. Put them in logs and traces instead.
Mandatory logs
Every decision a bot makes produces exactly one decision log with this shape:
{
"ts": "2026-05-09T05:51:12Z",
"level": "INFO",
"bot_id": "risk.liquidity_guard",
"intent_id": "01HZS3Q5PWY",
"decision": "RESHAPE_REQUIRED",
"reason_code": "INSUFFICIENT_VISIBLE_DEPTH",
"severity": "RESHAPE",
"inputs_used": ["clob.book.top50", "internal.spread.median30d"],
"metrics": { "visible_depth_usd": 5200, "requested_size_usd": 1850, "pct_of_depth": 0.356 },
"trace_id": "8a3...",
"span_id": "f12..."
}
Mandatory alerts — every bot
| Alert | Condition | Severity |
|---|---|---|
BotDown | No invocations in 5 min during a period the bot should be active. | P1 |
BotErrorRateHigh | 5xx ratio > 1% over 5 min. | P1 |
BotInputStale | polytraders_bot_inputs_stale = 1 for > 2 min. | P1 |
BotLatencyP99High | P99 latency > 2× SLO over 5 min. | P2 |
Tracing
Every OrderIntent opens a root span order_intent. Every bot consulted is a child span named <layer>.<bot>. The trace ends when the order either rejects, fills, or is cancelled. Builder attribution and reconciliation post-fill events join the same trace via links.