1. Bot Identity
| Layer | Security Security |
|---|
| Bot class | Guardrail |
|---|
| Authority | RejectPause |
|---|
| Status | PLANNED |
|---|
| Readiness | Spec started |
|---|
| Runs before | Any bot that makes on-chain read calls |
|---|
| Runs after | System startup; continuous background probe |
|---|
| Applies to | All Polygon RPC endpoints in the configured provider pool |
|---|
| Default mode | shadow_only |
|---|
| User-visible | Advanced details only |
|---|
| Developer owner | Polytraders core |
|---|
Operational profile
| Modes supported | quarantine |
|---|
2. Purpose
Probe RPC providers continuously and fail over before a stale endpoint poisons our chain view.
3. Why This Bot Matters
Single RPC endpoint goes stale
All bots reading chain state see an outdated block, causing mispriced or incorrectly-scoped orders.
No quorum check across providers
A forked or malicious RPC can poison chain state views used for contract address and balance checks.
No auto-quarantine of degraded provider
A slow or error-prone endpoint keeps being polled, adding latency to every on-chain check.
No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.
6. Parameter Guide
| Parameter | Default | Warning | Hard | What it controls |
|---|
| max_block_lag | 3 | Block lag >= 2 for any provider | Block lag >= max_block_lag for primary provider | Maximum tolerated block height difference before a provider is quarantined. |
| min_providers_quorum | 2 | Only min_providers_quorum providers healthy | Fewer than min_providers_quorum healthy providers available | Minimum number of healthy providers required before any chain read is trusted. |
7. Detailed Parameter Instructions
max_block_lag
What it means
Maximum tolerated block height difference before a provider is quarantined.
Default
{ "max_block_lag": 3 }
Why this default matters
3-block lag on Polygon (~6s) is the threshold where stale data becomes operationally dangerous.
Threshold logic
| Condition | Action |
|---|
| lag < max_block_lag | APPROVE — provider healthy |
| lag >= max_block_lag AND auto_quarantine=true | Quarantine provider; failover to next in pool |
| lag >= max_block_lag AND no healthy provider | REJECT — RPC_QUORUM_LOST |
Developer check
if (lag >= p.max_block_lag && p.auto_quarantine) quarantine(provider);
User-facing English
The network connection is degraded. Orders are paused until connectivity is restored.
min_providers_quorum
What it means
Minimum number of healthy providers required before any chain read is trusted.
Default
{ "min_providers_quorum": 2 }
Why this default matters
Quorum of 2 prevents a single compromised provider from poisoning chain state.
Threshold logic
| Condition | Action |
|---|
| healthy_count >= min_providers_quorum | APPROVE |
| healthy_count < min_providers_quorum | REJECT — RPC_QUORUM_LOST |
Developer check
if (healthyProviders.length < p.min_providers_quorum) return reject('RPC_QUORUM_LOST');
User-facing English
Not enough network providers are available. Orders are paused.
8. Default Configuration
{
"bot_id": "sec.rpc_failover_manager",
"version": "0.1.0",
"mode": "hard_guard",
"defaults": {
"max_block_lag": 3,
"min_providers_quorum": 2,
"auto_quarantine": true,
"probe_interval_s": 5
}
}
9. Implementation Flow
- On startup: load provider pool from Admin UI config.
- Background loop every probe_interval_s: call eth_blockNumber on all providers.
- Compute block height divergence across providers.
- For providers with lag >= max_block_lag and auto_quarantine=true: mark quarantined.
- Check healthy provider count; if < min_providers_quorum: REJECT(RPC_QUORUM_LOST) on all pending chain reads.
- Elect primary provider as the one with highest block height and lowest latency.
- On order arrival: return current primary provider endpoint for chain reads.
- Periodically re-probe quarantined providers; restore if lag normalises.
10. Reference Implementation
Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.
// RPCFailoverManager
STATE = { providers: [], primary: null, quarantined: [] }
// Background probe loop
EVERY params.probe_interval_s:
heights = []
FOR provider IN STATE.providers:
h = FETCH(provider.eth_blockNumber())
IF h == null: quarantine(provider); CONTINUE
heights.append({provider, h, latency})
max_h = MAX(heights.map(x => x.h))
FOR entry IN heights:
lag = max_h - entry.h
IF lag >= params.max_block_lag AND params.auto_quarantine:
quarantine(entry.provider)
healthy = heights.filter(x => (max_h - x.h) < params.max_block_lag)
IF healthy.count < params.min_providers_quorum:
EMIT RiskVote(DENY, RPC_QUORUM_LOST)
ELSE:
STATE.primary = healthy.sort_by(latency).first()
EMIT RiskVote(APPROVE)
// Per-request provider lookup
FUNCTION getPrimaryProvider():
IF STATE.primary == null: return DENY(RPC_QUORUM_LOST)
RETURN STATE.primary
SDK calls used
provider.eth_blockNumber()internal.killswitch.status()
Complexity: O(p) per probe where p = provider pool size (small constant)
11. Wire Examples
Input — what arrives on the wire
Probe result from provider pool — onchain
{
"provider": "alchemy-polygon-1",
"block_number": 58420100,
"latency_ms": 45,
"timestamp_ms": 1746768672000
}
Output — what the bot emits
RiskVote — APPROVE with primary
{
"vote_id": "sec.rpc_failover_manager.20260509T170000Z",
"decision": "APPROVE",
"reason_code": null,
"evidence": {
"primary_provider": "alchemy-polygon-1",
"healthy_count": 3,
"max_lag_blocks": 1
},
"checked_at": "2026-05-09T17:00:00Z"
}
12. Decision Logic
APPROVE
At least min_providers_quorum healthy providers with lag < max_block_lag; primary elected.
RESHAPE_REQUIRED
Not applicable — manager either provides a healthy endpoint or rejects.
REJECT
Fewer than min_providers_quorum healthy providers (RPC_QUORUM_LOST).
WARNING_ONLY
Warn when only min_providers_quorum providers remain healthy.
13. Standard Decision Output
This bot returns a RiskVote object. See RiskVote schema.
{
"vote_id": "sec.rpc_failover_manager.20260509T170000Z",
"decision": "APPROVE",
"reason_code": null,
"evidence": {
"primary_provider": "alchemy-polygon-1",
"healthy_count": 3,
"quarantined_count": 0,
"max_lag_blocks": 1
},
"checked_at": "2026-05-09T17:00:00Z"
}
14. Reason Codes
| Code | Severity | Meaning | Action | User-facing message |
|---|
KILL_SWITCH_ACTIVE | HARD_REJECT | Global kill switch is active. | Immediately return DENY. | Trading is currently paused. |
RPC_QUORUM_LOST | HARD_REJECT | Fewer than min_providers_quorum healthy RPC providers available. | Return DENY on all chain reads until quorum restored. | The network connection is degraded. Orders are paused. |
RPC_PROVIDER_LAGGING | WARN | A provider's block height lags by 2 blocks; approaching quarantine threshold. | Log warn; keep provider active; increase probe frequency. | Network connectivity is slightly degraded. |
RPC_QUORUM_WARN | WARN | Only min_providers_quorum providers remain healthy; one more failure triggers reject. | Emit warn; notify ops. | Network connectivity is limited. |
RPC_FAILOVER_INFO | INFO | Primary provider switched to a new endpoint. | Log info; no action needed. | Network provider updated automatically. |
15. Metrics & Logs
Metrics emitted
| Metric | Type | Unit | Labels | Meaning |
|---|
polytraders_sec_rpcfailovermanager_healthy_providers | gauge | count | | Number of currently healthy providers. |
polytraders_sec_rpcfailovermanager_block_lag | gauge | blocks | provider | Current block lag per provider relative to maximum observed. |
polytraders_sec_rpcfailovermanager_failovers_total | counter | count | | Number of primary provider switches. |
polytraders_sec_rpcfailovermanager_probe_latency_ms | histogram | ms | provider | Latency of eth_blockNumber probe per provider. |
Alerts
| Alert | Condition | Severity | Runbook |
|---|
RPCQuorumLost | polytraders_sec_rpcfailovermanager_healthy_providers < min_providers_quorum | P0 | #runbook-rpc-quorum-lost |
RPCHighFailoverRate | rate(polytraders_sec_rpcfailovermanager_failovers_total[5m]) > 2 | P1 | #runbook-rpc-failover-rate |
16. Developer Reporting
{
"bot_id": "sec.rpc_failover_manager",
"decision": "APPROVE",
"inputs_used": [
"onchain.eth_blockNumber",
"config.provider_pool"
],
"checked_at": "2026-05-09T17:00:00Z"
}
17. Plain-English Reporting
| Situation | User-facing explanation |
|---|
| Orders paused — RPC quorum lost | The network connection is degraded. Orders are paused until connectivity is restored. |
| Provider failover | The primary network provider was switched automatically. No action needed. |
| Provider quarantined | One of the network providers was temporarily taken offline. Others are being used. |
18. Failure-Mode Block
| main_failure_mode | All providers simultaneously stale, causing chain state blindness for all on-chain checks. |
|---|
| false_positive_risk | A brief network hiccup quarantines providers unnecessarily, causing trading pause until re-probe succeeds. |
|---|
| false_negative_risk | Two providers both stale at the same height pass the quorum check but provide wrong data. |
|---|
| safe_fallback | If fewer than min_providers_quorum healthy providers: fail-closed on all chain reads; emit RPC_QUORUM_LOST. |
|---|
| required_dependencies | Configured RPC provider pool (at least 3 recommended), Admin UI config, KillSwitch |
|---|
19. Failure-Injection Recipes
| Scenario | How to inject | Expected behaviour | Recovery |
|---|
PRIMARY_PROVIDER_STALE | Stop block production on primary provider (simulate stale) | | Automatic when provider resumes producing blocks. |
QUORUM_LOST | Quarantine all but 1 provider | | Providers recover; quorum restored on next probe. |
ALL_PROVIDERS_DOWN | Block all RPC endpoints | | Manual provider pool update or network restoration. |
20. State & Persistence
Cold-start recovery
Re-probe all providers on restart; no persistent state required.
21. Concurrency & Idempotency
| Aspect | Specification |
|---|
| Execution model | background probe loop + sync per-request lookup |
| Max in-flight | 10 |
| Idempotency key | probe_timestamp |
| Per-call timeout (ms) | 1000 |
| Backpressure strategy | drop probe if previous not complete |
| Locking / mutual exclusion | read-write lock on provider health state |
22. Dependencies
Depends on (must run first)
Emits to (downstream consumers)
Sibling bots (same OrderIntent)
Used by (auto-aggregated)
5.2 5.7
External services
| Service | Endpoint | SLA assumed | On failure |
|---|
| Polygon RPC pool | Configured provider endpoints | best-effort per provider | Quarantine and failover; DENY if quorum lost. |
23. Security Surfaces
Abuse vectors considered
- Compromised RPC provider returning fraudulent block heights to pass quorum
- BGP hijack routing traffic to malicious RPC node
Mitigations
- Quorum of min_providers_quorum providers required; single provider cannot pass unilaterally
- Provider pool configured in Admin UI with TLS-pinned endpoints
24. Polymarket V2 Compatibility
| Aspect | Value |
|---|
| CLOB version | v2 |
| Collateral asset | pUSD |
| EIP-712 Exchange domain version | 2 |
| Aware of builderCode field | no |
| Aware of negative-risk markets | no |
| Multi-chain ready | no |
| SDK used | py-clob-client-v2 |
| Settlement contract | CTFExchangeV2 |
| Notes | Manages Polygon RPC provider health to ensure CTFExchangeV2 and pUSD contract reads are from a fresh block. |
API surfaces declared
onchaininternal
Networks supported
polygon
25. Versioning & Migration
| Field | Value |
|---|
| spec | 2.0.0 |
| implementation | 0.1.0 |
| schema | 2 |
| released | None |
| planned_release | Q3-2026 |
Migration history
| Date | From | To | Reason | Action taken |
|---|
| 2026-04-28 | n/a | v2-spec | Spec drafted post-CLOB-V2 cutover; bot not yet implemented | Designed against V2 schema (pUSD, builder codes, V2 EIP-712 domain) |
26. Acceptance Tests
Unit Tests
| Test | Setup | Expected result |
|---|
| Approve when primary provider has lag < max_block_lag | provider lag=1, max_block_lag=3 | APPROVE; primary returned |
| Quarantine provider with lag >= max_block_lag | provider lag=4, max_block_lag=3, auto_quarantine=true | Provider quarantined; failover to secondary |
| Reject when healthy providers < min_providers_quorum | only 1 healthy provider, min_providers_quorum=2 | DENY(RPC_QUORUM_LOST) |
Integration Tests
| Test | Expected result |
|---|
| Quarantined provider restored after lag normalises | Provider un-quarantined on next probe with lag < max_block_lag |
| KillSwitch halts all probes and rejects chain reads | DENY(KILL_SWITCH_ACTIVE) on all reads |
Property Tests
| Property | Required behaviour |
|---|
| healthy_count < min_providers_quorum always produces DENY | Always true |
| Primary provider always has lowest lag among healthy set | Always true |
27. Operational Runbook
RPCQuorumLost is a P0 event — all chain-dependent checks are blocked. Restore provider connectivity immediately.
On-call actions
| Alert | First step | Diagnosis | Mitigation | Escalate to |
|---|
RPCQuorumLost | | | | |
RPCHighFailoverRate | | | | |
Manual overrides
Healthcheck
GET /internal/health/rpcfailovermanager → green if At least min_providers_quorum healthy providers; primary elected within last probe_interval_s.; red if healthy_count < min_providers_quorum or no probe completed in last 30s.
29. Developer Checklist
Ready-to-ship score: 27/27 sections complete · 100%
| Requirement | Status |
|---|
| Purpose defined | ✓ done |
| Required inputs listed | ✓ done |
| Parameters defined | ✓ done |
| Defaults defined | ✓ done |
| Warning thresholds defined | ✓ done |
| Hard thresholds defined | ✓ done |
| Safe fallback defined | ✓ done |
| Structured output defined | ✓ done |
| Developer log defined | ✓ done |
| Plain-English explanation | ✓ done |
| Unit tests defined | ✓ done |
| Integration tests defined | ✓ done |
| Property tests defined | ✓ done |
| Failure-mode block complete | ✓ done |
| Reference implementation pseudocode | ✓ done |
| Wire examples (input + output) | ✓ done |
| Reason codes listed | ✓ done |
| Metrics & logs defined | ✓ done |
| State & persistence defined | ✓ done |
| Concurrency & idempotency defined | ✓ done |
| Dependencies declared | ✓ done |
| Security surfaces declared | ✓ done |
| Polymarket V2 compatibility declared | ✓ done |
| Version & migration history declared | ✓ done |
| Operational runbook defined | ✓ done |
| Promotion gates defined | ✓ done |
| Failure-injection recipes defined | ✓ done |