Home › By Layer › Governance › 6.17 APIDegradationMonitor

6.17 APIDegradationMonitor

Governance Governance Observe PLANNED Spec ready capital · Indirect P2 · Data normalisation ○ pending stub

Watches every external API surface Polytraders depends on (CLOB v2 REST, CLOB WebSocket, Polymarket metadata REST, Ethereum RPC, builder fee oracle) and publishes a per-surface health envelope (latency p50/p99, error rate, last_success_ts_ms). Risk and Strategy bots consume this envelope to decide whether to operate normally, degrade, or pause.

v3 readiness

Docs27/27

donehow scored

Impl0/15

pendinghow scored

Backtest0/4

pendinghow scored

Runtime0/8

pendinghow scored

A bot is done when all four scores are. What does done mean?

← 6.16 ExposureExplainer 6.18 ReplaySimulator →

1. Bot Identity

Layer	Governance Governance
Bot class	Governance
Authority	Observe
Status	PLANNED
Readiness	Spec ready
Runs before	risk.killswitch, exec.smart_router
Runs after	—
Applies to	Continuous
Default mode	`shadow`
User-visible	Yes
Developer owner	Governance pod

Operational profile

Ownership	Governance pod · on-call gov-oncall · #polytraders-gov · escalates to Head of Governance · P1
Latency budget	p50: 50ms · p99: 250ms
Modes supported	offshadowadvisoryenforced
Data freshness	max_market_data_age_ms=10000 · max_orderbook_age_ms=10000 · max_external_feed_age_ms=10000 · on stale → Emit status=UNKNOWN — never silently report OK.
Human override	no · by — · logs — · time-bound: — · scope: — · single approver

2. Purpose

3. Why This Bot Matters

Cascading failures from a single dead dependency
Without an explicit health signal, every bot infers liveness from its own latest call — producing inconsistent retreat behaviour across the system.
Silent degradations
An API can stay up but slow to 30-second responses; bots without an explicit threshold keep blocking on it instead of failing fast.
Postmortem confusion
Without a health timeline, postmortems cannot answer 'what was the actual external latency at 14:23?'.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

Input	Source	Required?	Use
CLOB REST + WebSocket	`Polymarket`	Yes	Probe latency and error rates.
Polymarket metadata REST	`Polymarket`	Yes	Health probe.
Ethereum RPC	`RPC provider`	Yes	Latency + block-tip lag probe.

5. Required Internal Inputs

Input	Source	Required?	Use
Real outbound traffic latency samples	`Every bot`	Yes	Passive observation in addition to active probes.

6. Parameter Guide

Parameter	Default	Warning	Hard	What it controls
probe_interval_ms	`5000`	`—`	`—`	How often each surface is actively probed.
warn_p99_ms	`750`	`750`	`—`	p99 latency at which the surface is marked DEGRADED.
fail_p99_ms	`5000`	`—`	`5000`	p99 latency at which the surface is marked DOWN.
fail_error_rate_pct	`25`	`10`	`25`	Error rate at which the surface is marked DOWN regardless of latency.

7. Detailed Parameter Instructions

probe_interval_ms

What it means

How often each surface is actively probed.

Default

{ "probe_interval_ms": 5000 }

Why this default matters

5s gives quick detection without flooding upstreams.

Threshold logic

Condition	Action
5000	Default

Developer check

schedule.every(p.probe_interval_ms).do(probe);

User-facing English

(Internal.)

warn_p99_ms

What it means

p99 latency at which the surface is marked DEGRADED.

Default

{ "warn_p99_ms": 750 }

Why this default matters

750ms p99 is the empirical breakpoint where downstream pipelines start to tail out.

Threshold logic

Condition	Action
≤ 750ms	OK
> 750ms	DEGRADED

Developer check

if (p99 > p.warn_p99_ms) status = 'DEGRADED';

User-facing English

(Internal.)

fail_p99_ms

What it means

p99 latency at which the surface is marked DOWN.

Default

{ "fail_p99_ms": 5000 }

Why this default matters

5s p99 means almost every operation is timing out.

Threshold logic

Condition	Action
≤ 5000ms	Better than DOWN
> 5000ms	DOWN

Developer check

if (p99 > p.fail_p99_ms) status = 'DOWN';

User-facing English

(Internal.)

fail_error_rate_pct

What it means

Error rate at which the surface is marked DOWN regardless of latency.

Default

{ "fail_error_rate_pct": 25 }

Why this default matters

25% errors over a 1-minute window is an obvious outage.

Threshold logic

Condition	Action
< 10%	OK
10–25%	DEGRADED
> 25%	DOWN

Developer check

if (errRate > p.fail_error_rate_pct) status = 'DOWN';

User-facing English

(Internal.)

8. Default Configuration

{
  "probe_interval_ms": 5000,
  "warn_p99_ms": 750,
  "fail_p99_ms": 5000,
  "fail_error_rate_pct": 25
}

9. Implementation Flow

— not yet authored —

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

for each surface s:
  samples = window(s, 60_000)
  p50, p99 = quantiles(samples)
  err = error_rate(samples)
  status = classify(p99, err, p)
  emit('ApiHealthReport', s, status, p50, p99, err, last_success_ts_ms[s])

11. Wire Examples

Input — what arrives on the wire

{
  "surface": "clob_v2_rest",
  "samples": [
    {
      "ts_ms": 1715260000000,
      "latency_ms": 220,
      "ok": true
    }
  ]
}

Output — what the bot emits

{
  "kind": "ApiHealthReport",
  "surface": "clob_v2_rest",
  "status": "OK",
  "p50_ms": 220,
  "p99_ms": 220,
  "error_rate_pct": 0
}

12. Decision Logic

APPROVE

Sample active probes + passive traffic. Latch DOWN status until two consecutive OK windows.

RESHAPE_REQUIRED

This bot does not reshape orders.

REJECT

No reject path defined for this bot — it is observe-only.

WARNING_ONLY

Apply warn/fail thresholds.

13. Standard Decision Output

This bot returns a RiskVote object. See RiskVote schema.

{
  "kind": "ApiHealthReport",
  "surface": "clob_v2_rest",
  "status": "DEGRADED",
  "p50_ms": 220,
  "p99_ms": 980,
  "error_rate_pct": 4.1,
  "last_success_ts_ms": 1715260000000
}

14. Reason Codes

Code	Severity	Meaning	Action	User-facing message
`GOV_API_OK`	P3	Gov Api Ok	See decision output and developer log for context.	The system briefly slowed down because one of the data sources we depend on was responding slowly.
`GOV_API_DEGRADED`	P3	Gov Api Degraded	See decision output and developer log for context.	The system briefly slowed down because one of the data sources we depend on was responding slowly.
`GOV_API_DOWN`	P3	Gov Api Down	See decision output and developer log for context.	The system briefly slowed down because one of the data sources we depend on was responding slowly.
`GOV_API_UNKNOWN`	P3	Gov Api Unknown	See decision output and developer log for context.	The system briefly slowed down because one of the data sources we depend on was responding slowly.

15. Metrics & Logs

Metrics emitted

Metric	Type	Unit	Labels	Meaning
`api_p50_ms`	histogram	ms	bot_id	Api p50 ms.
`api_p99_ms`	histogram	ms	bot_id	Api p99 ms.
`api_error_rate_pct`	gauge	value	bot_id	Api error rate pct.
`api_status_changes_total`	counter	event	bot_id	Api status changes total.

Dashboards

6.17 overview dashboard

16. Developer Reporting

"Per emission: surface, status, p50, p99, error_rate, sample_count."

17. Plain-English Reporting

Situation	User-facing explanation
When this bot acts	The system briefly slowed down because one of the data sources we depend on was responding slowly.

18. Failure-Mode Block

main_failure_mode	Calling a surface DOWN when only the active probe is failing but real traffic is fine (or vice versa).
false_positive_risk	Active probe hits an old endpoint not used in production; mitigation: probes mirror real traffic shape.
false_negative_risk	Surface only fails on writes; passive read samples mask the issue; mitigation: write-side probes count separately.
safe_fallback	If the monitor itself fails, emit a synthetic ApiHealthReport with status=UNKNOWN and a non-stale ts_ms. Consumers must treat UNKNOWN as DEGRADED.
required_dependencies	—

19. Failure-Injection Recipes

Scenario	How to inject	Expected behaviour	Recovery
`Drop probe responses for 60s and assert status flips DOWN`	Drop probe responses for 60s and assert status flips DOWN.	Bot detects within its latency budget and emits the corresponding reason code.	Remove the injected fault; bot returns to healthy state within one debounce window.
`Disconnect the probe scheduler and assert UNKNOWN is emitted within one probe in`	Disconnect the probe scheduler and assert UNKNOWN is emitted within one probe interval.	Bot detects within its latency budget and emits the corresponding reason code.	Remove the injected fault; bot returns to healthy state within one debounce window.

20. State & Persistence

Per-surface rolling sample buffer + last status. In-memory; reseeds on restart.

State stores

Name	Kind	Key	Value shape	TTL	Durability
`api_degradation_monitor_state`	in-memory + fast KV mirror	bot_id	Per-surface rolling sample buffer + last status. In-memory; reseeds on restart.	24h	crash-safe via KV mirror

Cold-start recovery

Cold-start hydrates from fast KV; missing keys default to safe fallback.

On restart

All in-flight decisions are re-evaluated; no bot decision is trusted across restart without re-emit.

21. Concurrency & Idempotency

Aspect	Specification
Execution model	`One worker per surface; emits to a single status feed.`
Max in-flight	`32`
Idempotency key	`order_intent_id`
Replay-safe	`True`
Deduplication	`By idempotency_key within a 60s window.`
Ordering guarantees	`Per-market_id FIFO; cross-market unordered.`
Per-call timeout (ms)	`250`
Backpressure strategy	`Bounded queue; oldest-dropped with metric increment when full.`
Locking / mutual exclusion	`Per-market_id mutex; no global locks.`

22. Dependencies

Emits to (downstream consumers)

Bot	Why	Contract
risk.killswitch
exec.smart_router

Required before (graph.required_before)

risk.killswitch exec.smart_router

Consumes	`ProbeSample` `TrafficSample`
Emits	`OperationsReport(kind=ApiHealthReport)`
Blocks orders	no

23. Security Surfaces

Probe credentials are read-only API keys with no order-placement scope.

Signing surface

None — bot does not sign or submit.

Mitigations

Rate-limit per source
Audit-log every override
Require role-based authz on admin paths

24. Polymarket V2 Compatibility

Aspect	Value
CLOB version	`V2`
Collateral asset	`pUSD`
EIP-712 Exchange domain version	`2`
Aware of builderCode field	yes
Aware of negative-risk markets	yes
Multi-chain ready	yes
SDK used	`Polymarket CLOB V2 SDK`
Settlement contract	`CTFExchangeV2`
Notes	`Surface 'clob_v2_rest' specifically targets V2 endpoints.`

25. Versioning & Migration

Field	Value
current	`0.1.0`
contract_version	`1.0.0`
last_breaking_change	`none`
deprecation_window_days	`30`

26. Acceptance Tests

Unit Tests

Test	Setup	Expected result
p99 = warn_p99_ms + 1 → DEGRADED.	Synthetic fixture per template.	Behaviour matches the rule described in the test name.
Error rate = fail_error_rate_pct + 1 → DOWN.	Synthetic fixture per template.	Behaviour matches the rule described in the test name.

Integration Tests

Test	Expected result
Inject a slow-loris response on the clob_v2_rest probe → status flips to DEGRADED within 2 probe intervals.	End-to-end behaviour matches the spec without manual intervention.

Property Tests

Property	Required behaviour
status transitions are monotonic within a single window: OK ↔ DEGRADED ↔ DOWN, no skip.	Always true across all generated inputs.

27. Operational Runbook

If a surface is stuck DEGRADED with no obvious cause, increase probe_interval_ms temporarily and inspect upstream provider's status page.

On-call actions

Alert	First step	Diagnosis	Mitigation	Escalate to
`6.17_anomaly`	Open the bot's reporting page and confirm the alert is real (not a metric hiccup).	Inspect developer log entries for the affected market_id over the last 30 minutes.	Force-clear via Admin UI if the rule is clearly stale; otherwise leave engaged and notify owner.	Governance pod

Manual overrides

polytraders bot pause 6.17 — Disables the bot's enforcement layer; downstream consumers fall back to safe defaults.

Healthcheck

GET /healthz/api_degradation_monitor → 200 if last successful evaluation < 60s ago.

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

Gate	How measured	Threshold
Stub	probe-suite passes against synthetic surfaces.	Documented threshold met for the full window.

Promote to Limited live

Gate	How measured	Threshold
Shadow	14 days; status feed compared with the upstream's own status page.	Documented threshold met for the full window.
Advisory	7 days.	Documented threshold met for the full window.

Promote to General live

Gate	How measured	Threshold
Enforced	KillSwitch and SmartRouter consume the feed.	Documented threshold met for the full window.

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

Requirement	Status
Purpose defined	✓ done
Required inputs listed	✓ done
Parameters defined	✓ done
Defaults defined	✓ done
Warning thresholds defined	✓ done
Hard thresholds defined	✓ done
Safe fallback defined	✓ done
Structured output defined	✓ done
Developer log defined	✓ done
Plain-English explanation	✓ done
Unit tests defined	✓ done
Integration tests defined	✓ done
Property tests defined	✓ done
Failure-mode block complete	✓ done
Reference implementation pseudocode	✓ done
Wire examples (input + output)	✓ done
Reason codes listed	✓ done
Metrics & logs defined	✓ done
State & persistence defined	✓ done
Concurrency & idempotency defined	✓ done
Dependencies declared	✓ done
Security surfaces declared	✓ done
Polymarket V2 compatibility declared	✓ done
Version & migration history declared	✓ done
Operational runbook defined	✓ done
Promotion gates defined	✓ done
Failure-injection recipes defined	✓ done

6.17 APIDegradationMonitor

v3 readiness

1. Bot Identity

Operational profile

2. Purpose

3. Why This Bot Matters

Cascading failures from a single dead dependency

Silent degradations

Postmortem confusion

4. Required Polymarket Inputs

5. Required Internal Inputs

6. Parameter Guide

7. Detailed Parameter Instructions

probe_interval_ms

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

warn_p99_ms

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

fail_p99_ms

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

fail_error_rate_pct

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

8. Default Configuration

9. Implementation Flow

10. Reference Implementation

11. Wire Examples

Input — what arrives on the wire

Output — what the bot emits

12. Decision Logic

APPROVE

RESHAPE_REQUIRED

REJECT

WARNING_ONLY

13. Standard Decision Output

14. Reason Codes

15. Metrics & Logs

Metrics emitted

Dashboards

16. Developer Reporting

17. Plain-English Reporting

18. Failure-Mode Block

19. Failure-Injection Recipes

20. State & Persistence

State stores

Cold-start recovery

On restart

21. Concurrency & Idempotency

22. Dependencies

Emits to (downstream consumers)

Required before (graph.required_before)

23. Security Surfaces

Signing surface

Mitigations

24. Polymarket V2 Compatibility

25. Versioning & Migration

26. Acceptance Tests

Unit Tests

Integration Tests

Property Tests

27. Operational Runbook

On-call actions