Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board
HomeBy LayerGovernance6.17 APIDegradationMonitor

6.17 APIDegradationMonitor

Governance Governance Observe PLANNED Spec ready capital · Indirect P2 · Data normalisation pending stub

Watches every external API surface Polytraders depends on (CLOB v2 REST, CLOB WebSocket, Polymarket metadata REST, Ethereum RPC, builder fee oracle) and publishes a per-surface health envelope (latency p50/p99, error rate, last_success_ts_ms). Risk and Strategy bots consume this envelope to decide whether to operate normally, degrade, or pause.

v3 readiness

Docs27/27
donehow scored
Impl0/15
pendinghow scored
Backtest0/4
pendinghow scored
Runtime0/8
pendinghow scored

A bot is done when all four scores are. What does done mean?

1. Bot Identity

LayerGovernance  Governance
Bot classGovernance
AuthorityObserve
StatusPLANNED
ReadinessSpec ready
Runs beforerisk.killswitch, exec.smart_router
Runs after
Applies toContinuous
Default modeshadow
User-visibleYes
Developer ownerGovernance pod

Operational profile

OwnershipGovernance pod · on-call gov-oncall · #polytraders-gov · escalates to Head of Governance · P1
Latency budgetp50: 50ms · p99: 250ms
Modes supportedoffshadowadvisoryenforced
Data freshnessmax_market_data_age_ms=10000 · max_orderbook_age_ms=10000 · max_external_feed_age_ms=10000 · on stale → Emit status=UNKNOWN — never silently report OK.
Human overrideno · by · logs · time-bound: — · scope: — · single approver

2. Purpose

Watches every external API surface Polytraders depends on (CLOB v2 REST, CLOB WebSocket, Polymarket metadata REST, Ethereum RPC, builder fee oracle) and publishes a per-surface health envelope (latency p50/p99, error rate, last_success_ts_ms). Risk and Strategy bots consume this envelope to decide whether to operate normally, degrade, or pause.

3. Why This Bot Matters

  • Cascading failures from a single dead dependency

    Without an explicit health signal, every bot infers liveness from its own latest call — producing inconsistent retreat behaviour across the system.

  • Silent degradations

    An API can stay up but slow to 30-second responses; bots without an explicit threshold keep blocking on it instead of failing fast.

  • Postmortem confusion

    Without a health timeline, postmortems cannot answer 'what was the actual external latency at 14:23?'.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

InputSourceRequired?Use
CLOB REST + WebSocketPolymarketYesProbe latency and error rates.
Polymarket metadata RESTPolymarketYesHealth probe.
Ethereum RPCRPC providerYesLatency + block-tip lag probe.

5. Required Internal Inputs

InputSourceRequired?Use
Real outbound traffic latency samplesEvery botYesPassive observation in addition to active probes.

6. Parameter Guide

ParameterDefaultWarningHardWhat it controls
probe_interval_ms5000How often each surface is actively probed.
warn_p99_ms750750p99 latency at which the surface is marked DEGRADED.
fail_p99_ms50005000p99 latency at which the surface is marked DOWN.
fail_error_rate_pct251025Error rate at which the surface is marked DOWN regardless of latency.

7. Detailed Parameter Instructions

probe_interval_ms

What it means

How often each surface is actively probed.

Default

{ "probe_interval_ms": 5000 }

Why this default matters

5s gives quick detection without flooding upstreams.

Threshold logic

ConditionAction
5000Default

Developer check

schedule.every(p.probe_interval_ms).do(probe);

User-facing English

(Internal.)

warn_p99_ms

What it means

p99 latency at which the surface is marked DEGRADED.

Default

{ "warn_p99_ms": 750 }

Why this default matters

750ms p99 is the empirical breakpoint where downstream pipelines start to tail out.

Threshold logic

ConditionAction
≤ 750msOK
> 750msDEGRADED

Developer check

if (p99 > p.warn_p99_ms) status = 'DEGRADED';

User-facing English

(Internal.)

fail_p99_ms

What it means

p99 latency at which the surface is marked DOWN.

Default

{ "fail_p99_ms": 5000 }

Why this default matters

5s p99 means almost every operation is timing out.

Threshold logic

ConditionAction
≤ 5000msBetter than DOWN
> 5000msDOWN

Developer check

if (p99 > p.fail_p99_ms) status = 'DOWN';

User-facing English

(Internal.)

fail_error_rate_pct

What it means

Error rate at which the surface is marked DOWN regardless of latency.

Default

{ "fail_error_rate_pct": 25 }

Why this default matters

25% errors over a 1-minute window is an obvious outage.

Threshold logic

ConditionAction
< 10%OK
10–25%DEGRADED
> 25%DOWN

Developer check

if (errRate > p.fail_error_rate_pct) status = 'DOWN';

User-facing English

(Internal.)

8. Default Configuration

{
  "probe_interval_ms": 5000,
  "warn_p99_ms": 750,
  "fail_p99_ms": 5000,
  "fail_error_rate_pct": 25
}

9. Implementation Flow

— not yet authored —

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

for each surface s:
  samples = window(s, 60_000)
  p50, p99 = quantiles(samples)
  err = error_rate(samples)
  status = classify(p99, err, p)
  emit('ApiHealthReport', s, status, p50, p99, err, last_success_ts_ms[s])

11. Wire Examples

Input — what arrives on the wire

{
  "surface": "clob_v2_rest",
  "samples": [
    {
      "ts_ms": 1715260000000,
      "latency_ms": 220,
      "ok": true
    }
  ]
}

Output — what the bot emits

{
  "kind": "ApiHealthReport",
  "surface": "clob_v2_rest",
  "status": "OK",
  "p50_ms": 220,
  "p99_ms": 220,
  "error_rate_pct": 0
}

12. Decision Logic

APPROVE

Sample active probes + passive traffic. Latch DOWN status until two consecutive OK windows.

RESHAPE_REQUIRED

This bot does not reshape orders.

REJECT

No reject path defined for this bot — it is observe-only.

WARNING_ONLY

Apply warn/fail thresholds.

13. Standard Decision Output

This bot returns a RiskVote object. See RiskVote schema.

{
  "kind": "ApiHealthReport",
  "surface": "clob_v2_rest",
  "status": "DEGRADED",
  "p50_ms": 220,
  "p99_ms": 980,
  "error_rate_pct": 4.1,
  "last_success_ts_ms": 1715260000000
}

14. Reason Codes

CodeSeverityMeaningActionUser-facing message
GOV_API_OKP3Gov Api OkSee decision output and developer log for context.The system briefly slowed down because one of the data sources we depend on was responding slowly.
GOV_API_DEGRADEDP3Gov Api DegradedSee decision output and developer log for context.The system briefly slowed down because one of the data sources we depend on was responding slowly.
GOV_API_DOWNP3Gov Api DownSee decision output and developer log for context.The system briefly slowed down because one of the data sources we depend on was responding slowly.
GOV_API_UNKNOWNP3Gov Api UnknownSee decision output and developer log for context.The system briefly slowed down because one of the data sources we depend on was responding slowly.

15. Metrics & Logs

Metrics emitted

MetricTypeUnitLabelsMeaning
api_p50_mshistogrammsbot_idApi p50 ms.
api_p99_mshistogrammsbot_idApi p99 ms.
api_error_rate_pctgaugevaluebot_idApi error rate pct.
api_status_changes_totalcountereventbot_idApi status changes total.

Dashboards

  • 6.17 overview dashboard

16. Developer Reporting

"Per emission: surface, status, p50, p99, error_rate, sample_count."

17. Plain-English Reporting

SituationUser-facing explanation
When this bot actsThe system briefly slowed down because one of the data sources we depend on was responding slowly.

18. Failure-Mode Block

main_failure_modeCalling a surface DOWN when only the active probe is failing but real traffic is fine (or vice versa).
false_positive_riskActive probe hits an old endpoint not used in production; mitigation: probes mirror real traffic shape.
false_negative_riskSurface only fails on writes; passive read samples mask the issue; mitigation: write-side probes count separately.
safe_fallbackIf the monitor itself fails, emit a synthetic ApiHealthReport with status=UNKNOWN and a non-stale ts_ms. Consumers must treat UNKNOWN as DEGRADED.
required_dependencies

19. Failure-Injection Recipes

ScenarioHow to injectExpected behaviourRecovery
Drop probe responses for 60s and assert status flips DOWNDrop probe responses for 60s and assert status flips DOWN.Bot detects within its latency budget and emits the corresponding reason code.Remove the injected fault; bot returns to healthy state within one debounce window.
Disconnect the probe scheduler and assert UNKNOWN is emitted within one probe inDisconnect the probe scheduler and assert UNKNOWN is emitted within one probe interval.Bot detects within its latency budget and emits the corresponding reason code.Remove the injected fault; bot returns to healthy state within one debounce window.

20. State & Persistence

Per-surface rolling sample buffer + last status. In-memory; reseeds on restart.

State stores

NameKindKeyValue shapeTTLDurability
api_degradation_monitor_statein-memory + fast KV mirrorbot_idPer-surface rolling sample buffer + last status. In-memory; reseeds on restart.24hcrash-safe via KV mirror

Cold-start recovery

Cold-start hydrates from fast KV; missing keys default to safe fallback.

On restart

All in-flight decisions are re-evaluated; no bot decision is trusted across restart without re-emit.

21. Concurrency & Idempotency

AspectSpecification
Execution modelOne worker per surface; emits to a single status feed.
Max in-flight32
Idempotency keyorder_intent_id
Replay-safeTrue
DeduplicationBy idempotency_key within a 60s window.
Ordering guaranteesPer-market_id FIFO; cross-market unordered.
Per-call timeout (ms)250
Backpressure strategyBounded queue; oldest-dropped with metric increment when full.
Locking / mutual exclusionPer-market_id mutex; no global locks.

22. Dependencies

Emits to (downstream consumers)

Required before (graph.required_before)

risk.killswitch exec.smart_router

ConsumesProbeSample TrafficSample
EmitsOperationsReport(kind=ApiHealthReport)
Blocks ordersno

23. Security Surfaces

Probe credentials are read-only API keys with no order-placement scope.

Signing surface

None — bot does not sign or submit.

Mitigations

  • Rate-limit per source
  • Audit-log every override
  • Require role-based authz on admin paths

24. Polymarket V2 Compatibility

AspectValue
CLOB versionV2
Collateral assetpUSD
EIP-712 Exchange domain version2
Aware of builderCode fieldyes
Aware of negative-risk marketsyes
Multi-chain readyyes
SDK usedPolymarket CLOB V2 SDK
Settlement contractCTFExchangeV2
NotesSurface 'clob_v2_rest' specifically targets V2 endpoints.

25. Versioning & Migration

FieldValue
current0.1.0
contract_version1.0.0
last_breaking_changenone
deprecation_window_days30

26. Acceptance Tests

Unit Tests

TestSetupExpected result
p99 = warn_p99_ms + 1 → DEGRADED.Synthetic fixture per template.Behaviour matches the rule described in the test name.
Error rate = fail_error_rate_pct + 1 → DOWN.Synthetic fixture per template.Behaviour matches the rule described in the test name.

Integration Tests

TestExpected result
Inject a slow-loris response on the clob_v2_rest probe → status flips to DEGRADED within 2 probe intervals.End-to-end behaviour matches the spec without manual intervention.

Property Tests

PropertyRequired behaviour
status transitions are monotonic within a single window: OK ↔ DEGRADED ↔ DOWN, no skip.Always true across all generated inputs.

27. Operational Runbook

If a surface is stuck DEGRADED with no obvious cause, increase probe_interval_ms temporarily and inspect upstream provider's status page.

On-call actions

AlertFirst stepDiagnosisMitigationEscalate to
6.17_anomalyOpen the bot's reporting page and confirm the alert is real (not a metric hiccup).Inspect developer log entries for the affected market_id over the last 30 minutes.Force-clear via Admin UI if the rule is clearly stale; otherwise leave engaged and notify owner.Governance pod

Manual overrides

  • polytraders bot pause 6.17 — Disables the bot's enforcement layer; downstream consumers fall back to safe defaults.

Healthcheck

GET /healthz/api_degradation_monitor → 200 if last successful evaluation < 60s ago.

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

GateHow measuredThreshold
Stubprobe-suite passes against synthetic surfaces.Documented threshold met for the full window.

Promote to Limited live

GateHow measuredThreshold
Shadow14 days; status feed compared with the upstream's own status page.Documented threshold met for the full window.
Advisory7 days.Documented threshold met for the full window.

Promote to General live

GateHow measuredThreshold
EnforcedKillSwitch and SmartRouter consume the feed.Documented threshold met for the full window.

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

RequirementStatus
Purpose defined✓ done
Required inputs listed✓ done
Parameters defined✓ done
Defaults defined✓ done
Warning thresholds defined✓ done
Hard thresholds defined✓ done
Safe fallback defined✓ done
Structured output defined✓ done
Developer log defined✓ done
Plain-English explanation✓ done
Unit tests defined✓ done
Integration tests defined✓ done
Property tests defined✓ done
Failure-mode block complete✓ done
Reference implementation pseudocode✓ done
Wire examples (input + output)✓ done
Reason codes listed✓ done
Metrics & logs defined✓ done
State & persistence defined✓ done
Concurrency & idempotency defined✓ done
Dependencies declared✓ done
Security surfaces declared✓ done
Polymarket V2 compatibility declared✓ done
Version & migration history declared✓ done
Operational runbook defined✓ done
Promotion gates defined✓ done
Failure-injection recipes defined✓ done