6.17 APIDegradationMonitor
Watches every external API surface Polytraders depends on (CLOB v2 REST, CLOB WebSocket, Polymarket metadata REST, Ethereum RPC, builder fee oracle) and publishes a per-surface health envelope (latency p50/p99, error rate, last_success_ts_ms). Risk and Strategy bots consume this envelope to decide whether to operate normally, degrade, or pause.
v3 readiness
A bot is done when all four scores are. What does done mean?
1. Bot Identity
| Layer | Governance Governance |
|---|---|
| Bot class | Governance |
| Authority | Observe |
| Status | PLANNED |
| Readiness | Spec ready |
| Runs before | risk.killswitch, exec.smart_router |
| Runs after | — |
| Applies to | Continuous |
| Default mode | shadow |
| User-visible | Yes |
| Developer owner | Governance pod |
Operational profile
| Ownership | Governance pod · on-call gov-oncall · #polytraders-gov · escalates to Head of Governance · P1 |
|---|---|
| Latency budget | p50: 50ms · p99: 250ms |
| Modes supported | offshadowadvisoryenforced |
| Data freshness | max_market_data_age_ms=10000 · max_orderbook_age_ms=10000 · max_external_feed_age_ms=10000 · on stale → Emit status=UNKNOWN — never silently report OK. |
| Human override | no · by — · logs — · time-bound: — · scope: — · single approver |
2. Purpose
Watches every external API surface Polytraders depends on (CLOB v2 REST, CLOB WebSocket, Polymarket metadata REST, Ethereum RPC, builder fee oracle) and publishes a per-surface health envelope (latency p50/p99, error rate, last_success_ts_ms). Risk and Strategy bots consume this envelope to decide whether to operate normally, degrade, or pause.
3. Why This Bot Matters
Cascading failures from a single dead dependency
Without an explicit health signal, every bot infers liveness from its own latest call — producing inconsistent retreat behaviour across the system.
Silent degradations
An API can stay up but slow to 30-second responses; bots without an explicit threshold keep blocking on it instead of failing fast.
Postmortem confusion
Without a health timeline, postmortems cannot answer 'what was the actual external latency at 14:23?'.
No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.
4. Required Polymarket Inputs
| Input | Source | Required? | Use |
|---|---|---|---|
| CLOB REST + WebSocket | Polymarket | Yes | Probe latency and error rates. |
| Polymarket metadata REST | Polymarket | Yes | Health probe. |
| Ethereum RPC | RPC provider | Yes | Latency + block-tip lag probe. |
5. Required Internal Inputs
| Input | Source | Required? | Use |
|---|---|---|---|
| Real outbound traffic latency samples | Every bot | Yes | Passive observation in addition to active probes. |
6. Parameter Guide
| Parameter | Default | Warning | Hard | What it controls |
|---|---|---|---|---|
| probe_interval_ms | 5000 | — | — | How often each surface is actively probed. |
| warn_p99_ms | 750 | 750 | — | p99 latency at which the surface is marked DEGRADED. |
| fail_p99_ms | 5000 | — | 5000 | p99 latency at which the surface is marked DOWN. |
| fail_error_rate_pct | 25 | 10 | 25 | Error rate at which the surface is marked DOWN regardless of latency. |
7. Detailed Parameter Instructions
probe_interval_ms
What it means
How often each surface is actively probed.
Default
{ "probe_interval_ms": 5000 }
Why this default matters
5s gives quick detection without flooding upstreams.
Threshold logic
| Condition | Action |
|---|---|
| 5000 | Default |
Developer check
schedule.every(p.probe_interval_ms).do(probe);
User-facing English
(Internal.)
warn_p99_ms
What it means
p99 latency at which the surface is marked DEGRADED.
Default
{ "warn_p99_ms": 750 }
Why this default matters
750ms p99 is the empirical breakpoint where downstream pipelines start to tail out.
Threshold logic
| Condition | Action |
|---|---|
| ≤ 750ms | OK |
| > 750ms | DEGRADED |
Developer check
if (p99 > p.warn_p99_ms) status = 'DEGRADED';
User-facing English
(Internal.)
fail_p99_ms
What it means
p99 latency at which the surface is marked DOWN.
Default
{ "fail_p99_ms": 5000 }
Why this default matters
5s p99 means almost every operation is timing out.
Threshold logic
| Condition | Action |
|---|---|
| ≤ 5000ms | Better than DOWN |
| > 5000ms | DOWN |
Developer check
if (p99 > p.fail_p99_ms) status = 'DOWN';
User-facing English
(Internal.)
fail_error_rate_pct
What it means
Error rate at which the surface is marked DOWN regardless of latency.
Default
{ "fail_error_rate_pct": 25 }
Why this default matters
25% errors over a 1-minute window is an obvious outage.
Threshold logic
| Condition | Action |
|---|---|
| < 10% | OK |
| 10–25% | DEGRADED |
| > 25% | DOWN |
Developer check
if (errRate > p.fail_error_rate_pct) status = 'DOWN';
User-facing English
(Internal.)
8. Default Configuration
{
"probe_interval_ms": 5000,
"warn_p99_ms": 750,
"fail_p99_ms": 5000,
"fail_error_rate_pct": 25
}9. Implementation Flow
— not yet authored —
10. Reference Implementation
Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.
for each surface s:
samples = window(s, 60_000)
p50, p99 = quantiles(samples)
err = error_rate(samples)
status = classify(p99, err, p)
emit('ApiHealthReport', s, status, p50, p99, err, last_success_ts_ms[s])11. Wire Examples
Input — what arrives on the wire
{
"surface": "clob_v2_rest",
"samples": [
{
"ts_ms": 1715260000000,
"latency_ms": 220,
"ok": true
}
]
}
Output — what the bot emits
{
"kind": "ApiHealthReport",
"surface": "clob_v2_rest",
"status": "OK",
"p50_ms": 220,
"p99_ms": 220,
"error_rate_pct": 0
}12. Decision Logic
APPROVE
Sample active probes + passive traffic. Latch DOWN status until two consecutive OK windows.
RESHAPE_REQUIRED
This bot does not reshape orders.
REJECT
No reject path defined for this bot — it is observe-only.
WARNING_ONLY
Apply warn/fail thresholds.
13. Standard Decision Output
This bot returns a RiskVote object. See RiskVote schema.
{
"kind": "ApiHealthReport",
"surface": "clob_v2_rest",
"status": "DEGRADED",
"p50_ms": 220,
"p99_ms": 980,
"error_rate_pct": 4.1,
"last_success_ts_ms": 1715260000000
}14. Reason Codes
| Code | Severity | Meaning | Action | User-facing message |
|---|---|---|---|---|
GOV_API_OK | P3 | Gov Api Ok | See decision output and developer log for context. | The system briefly slowed down because one of the data sources we depend on was responding slowly. |
GOV_API_DEGRADED | P3 | Gov Api Degraded | See decision output and developer log for context. | The system briefly slowed down because one of the data sources we depend on was responding slowly. |
GOV_API_DOWN | P3 | Gov Api Down | See decision output and developer log for context. | The system briefly slowed down because one of the data sources we depend on was responding slowly. |
GOV_API_UNKNOWN | P3 | Gov Api Unknown | See decision output and developer log for context. | The system briefly slowed down because one of the data sources we depend on was responding slowly. |
15. Metrics & Logs
Metrics emitted
| Metric | Type | Unit | Labels | Meaning |
|---|---|---|---|---|
api_p50_ms | histogram | ms | bot_id | Api p50 ms. |
api_p99_ms | histogram | ms | bot_id | Api p99 ms. |
api_error_rate_pct | gauge | value | bot_id | Api error rate pct. |
api_status_changes_total | counter | event | bot_id | Api status changes total. |
Dashboards
- 6.17 overview dashboard
16. Developer Reporting
"Per emission: surface, status, p50, p99, error_rate, sample_count."17. Plain-English Reporting
| Situation | User-facing explanation |
|---|---|
| When this bot acts | The system briefly slowed down because one of the data sources we depend on was responding slowly. |
18. Failure-Mode Block
| main_failure_mode | Calling a surface DOWN when only the active probe is failing but real traffic is fine (or vice versa). |
|---|---|
| false_positive_risk | Active probe hits an old endpoint not used in production; mitigation: probes mirror real traffic shape. |
| false_negative_risk | Surface only fails on writes; passive read samples mask the issue; mitigation: write-side probes count separately. |
| safe_fallback | If the monitor itself fails, emit a synthetic ApiHealthReport with status=UNKNOWN and a non-stale ts_ms. Consumers must treat UNKNOWN as DEGRADED. |
| required_dependencies | — |
19. Failure-Injection Recipes
| Scenario | How to inject | Expected behaviour | Recovery |
|---|---|---|---|
Drop probe responses for 60s and assert status flips DOWN | Drop probe responses for 60s and assert status flips DOWN. | Bot detects within its latency budget and emits the corresponding reason code. | Remove the injected fault; bot returns to healthy state within one debounce window. |
Disconnect the probe scheduler and assert UNKNOWN is emitted within one probe in | Disconnect the probe scheduler and assert UNKNOWN is emitted within one probe interval. | Bot detects within its latency budget and emits the corresponding reason code. | Remove the injected fault; bot returns to healthy state within one debounce window. |
20. State & Persistence
Per-surface rolling sample buffer + last status. In-memory; reseeds on restart.
State stores
| Name | Kind | Key | Value shape | TTL | Durability |
|---|---|---|---|---|---|
api_degradation_monitor_state | in-memory + fast KV mirror | bot_id | Per-surface rolling sample buffer + last status. In-memory; reseeds on restart. | 24h | crash-safe via KV mirror |
Cold-start recovery
Cold-start hydrates from fast KV; missing keys default to safe fallback.
On restart
All in-flight decisions are re-evaluated; no bot decision is trusted across restart without re-emit.
21. Concurrency & Idempotency
| Aspect | Specification |
|---|---|
| Execution model | One worker per surface; emits to a single status feed. |
| Max in-flight | 32 |
| Idempotency key | order_intent_id |
| Replay-safe | True |
| Deduplication | By idempotency_key within a 60s window. |
| Ordering guarantees | Per-market_id FIFO; cross-market unordered. |
| Per-call timeout (ms) | 250 |
| Backpressure strategy | Bounded queue; oldest-dropped with metric increment when full. |
| Locking / mutual exclusion | Per-market_id mutex; no global locks. |
22. Dependencies
Emits to (downstream consumers)
| Bot | Why | Contract |
|---|---|---|
| risk.killswitch | ||
| exec.smart_router |
Required before (graph.required_before)
risk.killswitch exec.smart_router
| Consumes | ProbeSample TrafficSample |
|---|---|
| Emits | OperationsReport(kind=ApiHealthReport) |
| Blocks orders | no |
23. Security Surfaces
Probe credentials are read-only API keys with no order-placement scope.
Signing surface
None — bot does not sign or submit.
Mitigations
- Rate-limit per source
- Audit-log every override
- Require role-based authz on admin paths
24. Polymarket V2 Compatibility
| Aspect | Value |
|---|---|
| CLOB version | V2 |
| Collateral asset | pUSD |
| EIP-712 Exchange domain version | 2 |
| Aware of builderCode field | yes |
| Aware of negative-risk markets | yes |
| Multi-chain ready | yes |
| SDK used | Polymarket CLOB V2 SDK |
| Settlement contract | CTFExchangeV2 |
| Notes | Surface 'clob_v2_rest' specifically targets V2 endpoints. |
25. Versioning & Migration
| Field | Value |
|---|---|
| current | 0.1.0 |
| contract_version | 1.0.0 |
| last_breaking_change | none |
| deprecation_window_days | 30 |
26. Acceptance Tests
Unit Tests
| Test | Setup | Expected result |
|---|---|---|
| p99 = warn_p99_ms + 1 → DEGRADED. | Synthetic fixture per template. | Behaviour matches the rule described in the test name. |
| Error rate = fail_error_rate_pct + 1 → DOWN. | Synthetic fixture per template. | Behaviour matches the rule described in the test name. |
Integration Tests
| Test | Expected result |
|---|---|
| Inject a slow-loris response on the clob_v2_rest probe → status flips to DEGRADED within 2 probe intervals. | End-to-end behaviour matches the spec without manual intervention. |
Property Tests
| Property | Required behaviour |
|---|---|
| status transitions are monotonic within a single window: OK ↔ DEGRADED ↔ DOWN, no skip. | Always true across all generated inputs. |
27. Operational Runbook
If a surface is stuck DEGRADED with no obvious cause, increase probe_interval_ms temporarily and inspect upstream provider's status page.
On-call actions
| Alert | First step | Diagnosis | Mitigation | Escalate to |
|---|---|---|---|---|
6.17_anomaly | Open the bot's reporting page and confirm the alert is real (not a metric hiccup). | Inspect developer log entries for the affected market_id over the last 30 minutes. | Force-clear via Admin UI if the rule is clearly stale; otherwise leave engaged and notify owner. | Governance pod |
Manual overrides
polytraders bot pause 6.17— Disables the bot's enforcement layer; downstream consumers fall back to safe defaults.
Healthcheck
GET /healthz/api_degradation_monitor → 200 if last successful evaluation < 60s ago.28. Promotion Gates
A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.
Promote to Shadow
| Gate | How measured | Threshold |
|---|---|---|
| Stub | probe-suite passes against synthetic surfaces. | Documented threshold met for the full window. |
Promote to Limited live
| Gate | How measured | Threshold |
|---|---|---|
| Shadow | 14 days; status feed compared with the upstream's own status page. | Documented threshold met for the full window. |
| Advisory | 7 days. | Documented threshold met for the full window. |
Promote to General live
| Gate | How measured | Threshold |
|---|---|---|
| Enforced | KillSwitch and SmartRouter consume the feed. | Documented threshold met for the full window. |
29. Developer Checklist
Ready-to-ship score: 27/27 sections complete · 100%
| Requirement | Status |
|---|---|
| Purpose defined | ✓ done |
| Required inputs listed | ✓ done |
| Parameters defined | ✓ done |
| Defaults defined | ✓ done |
| Warning thresholds defined | ✓ done |
| Hard thresholds defined | ✓ done |
| Safe fallback defined | ✓ done |
| Structured output defined | ✓ done |
| Developer log defined | ✓ done |
| Plain-English explanation | ✓ done |
| Unit tests defined | ✓ done |
| Integration tests defined | ✓ done |
| Property tests defined | ✓ done |
| Failure-mode block complete | ✓ done |
| Reference implementation pseudocode | ✓ done |
| Wire examples (input + output) | ✓ done |
| Reason codes listed | ✓ done |
| Metrics & logs defined | ✓ done |
| State & persistence defined | ✓ done |
| Concurrency & idempotency defined | ✓ done |
| Dependencies declared | ✓ done |
| Security surfaces declared | ✓ done |
| Polymarket V2 compatibility declared | ✓ done |
| Version & migration history declared | ✓ done |
| Operational runbook defined | ✓ done |
| Promotion gates defined | ✓ done |
| Failure-injection recipes defined | ✓ done |