Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board
HomeBy LayerSecurity5.6 RPCFailoverManager

5.6 RPCFailoverManager

Security Guardrail RejectPause PLANNED Spec started capital · Direct P5 · Execution rails pending stub

Probe RPC providers continuously and fail over before a stale endpoint poisons our chain view.

v3 readiness

Docs27/27
donehow scored
Impl0/15
pendinghow scored
Backtest0/4
pendinghow scored
Runtime0/8
pendinghow scored

A bot is done when all four scores are. What does done mean?

1. Bot Identity

LayerSecurity  Security
Bot classGuardrail
AuthorityRejectPause
StatusPLANNED
ReadinessSpec started
Runs beforeAny bot that makes on-chain read calls
Runs afterSystem startup; continuous background probe
Applies toAll Polygon RPC endpoints in the configured provider pool
Default modeshadow_only
User-visibleAdvanced details only
Developer ownerPolytraders core

Operational profile

Modes supportedquarantine

2. Purpose

Probe RPC providers continuously and fail over before a stale endpoint poisons our chain view.

3. Why This Bot Matters

  • Single RPC endpoint goes stale

    All bots reading chain state see an outdated block, causing mispriced or incorrectly-scoped orders.

  • No quorum check across providers

    A forked or malicious RPC can poison chain state views used for contract address and balance checks.

  • No auto-quarantine of degraded provider

    A slow or error-prone endpoint keeps being polled, adding latency to every on-chain check.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

InputSourceRequired?Use
eth_blockNumber from each configured RPC provideronchainYesMeasure block height divergence across providers to detect stale endpoints.

5. Required Internal Inputs

InputSourceRequired?Use
Configured RPC provider pool and probe_interval_sAdmin UIYesPool of providers to probe and failover thresholds.
KillSwitch active flagKillSwitchYesHalt all chain reads during global pause.

6. Parameter Guide

ParameterDefaultWarningHardWhat it controls
max_block_lag3Block lag >= 2 for any providerBlock lag >= max_block_lag for primary providerMaximum tolerated block height difference before a provider is quarantined.
min_providers_quorum2Only min_providers_quorum providers healthyFewer than min_providers_quorum healthy providers availableMinimum number of healthy providers required before any chain read is trusted.

7. Detailed Parameter Instructions

max_block_lag

What it means

Maximum tolerated block height difference before a provider is quarantined.

Default

{ "max_block_lag": 3 }

Why this default matters

3-block lag on Polygon (~6s) is the threshold where stale data becomes operationally dangerous.

Threshold logic

ConditionAction
lag < max_block_lagAPPROVE — provider healthy
lag >= max_block_lag AND auto_quarantine=trueQuarantine provider; failover to next in pool
lag >= max_block_lag AND no healthy providerREJECT — RPC_QUORUM_LOST

Developer check

if (lag >= p.max_block_lag && p.auto_quarantine) quarantine(provider);

User-facing English

The network connection is degraded. Orders are paused until connectivity is restored.

min_providers_quorum

What it means

Minimum number of healthy providers required before any chain read is trusted.

Default

{ "min_providers_quorum": 2 }

Why this default matters

Quorum of 2 prevents a single compromised provider from poisoning chain state.

Threshold logic

ConditionAction
healthy_count >= min_providers_quorumAPPROVE
healthy_count < min_providers_quorumREJECT — RPC_QUORUM_LOST

Developer check

if (healthyProviders.length < p.min_providers_quorum) return reject('RPC_QUORUM_LOST');

User-facing English

Not enough network providers are available. Orders are paused.

8. Default Configuration

{
  "bot_id": "sec.rpc_failover_manager",
  "version": "0.1.0",
  "mode": "hard_guard",
  "defaults": {
    "max_block_lag": 3,
    "min_providers_quorum": 2,
    "auto_quarantine": true,
    "probe_interval_s": 5
  }
}

9. Implementation Flow

  1. On startup: load provider pool from Admin UI config.
  2. Background loop every probe_interval_s: call eth_blockNumber on all providers.
  3. Compute block height divergence across providers.
  4. For providers with lag >= max_block_lag and auto_quarantine=true: mark quarantined.
  5. Check healthy provider count; if < min_providers_quorum: REJECT(RPC_QUORUM_LOST) on all pending chain reads.
  6. Elect primary provider as the one with highest block height and lowest latency.
  7. On order arrival: return current primary provider endpoint for chain reads.
  8. Periodically re-probe quarantined providers; restore if lag normalises.

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

// RPCFailoverManager
STATE = { providers: [], primary: null, quarantined: [] }

// Background probe loop
EVERY params.probe_interval_s:
  heights = []
  FOR provider IN STATE.providers:
    h = FETCH(provider.eth_blockNumber())
    IF h == null: quarantine(provider); CONTINUE
    heights.append({provider, h, latency})
  max_h = MAX(heights.map(x => x.h))
  FOR entry IN heights:
    lag = max_h - entry.h
    IF lag >= params.max_block_lag AND params.auto_quarantine:
      quarantine(entry.provider)
  healthy = heights.filter(x => (max_h - x.h) < params.max_block_lag)
  IF healthy.count < params.min_providers_quorum:
    EMIT RiskVote(DENY, RPC_QUORUM_LOST)
  ELSE:
    STATE.primary = healthy.sort_by(latency).first()
    EMIT RiskVote(APPROVE)

// Per-request provider lookup
FUNCTION getPrimaryProvider():
  IF STATE.primary == null: return DENY(RPC_QUORUM_LOST)
  RETURN STATE.primary

SDK calls used

  • provider.eth_blockNumber()
  • internal.killswitch.status()

Complexity: O(p) per probe where p = provider pool size (small constant)

11. Wire Examples

Input — what arrives on the wire

Probe result from provider poolonchain

{
  "provider": "alchemy-polygon-1",
  "block_number": 58420100,
  "latency_ms": 45,
  "timestamp_ms": 1746768672000
}

Output — what the bot emits

RiskVote — APPROVE with primary

{
  "vote_id": "sec.rpc_failover_manager.20260509T170000Z",
  "decision": "APPROVE",
  "reason_code": null,
  "evidence": {
    "primary_provider": "alchemy-polygon-1",
    "healthy_count": 3,
    "max_lag_blocks": 1
  },
  "checked_at": "2026-05-09T17:00:00Z"
}

12. Decision Logic

APPROVE

At least min_providers_quorum healthy providers with lag < max_block_lag; primary elected.

RESHAPE_REQUIRED

Not applicable — manager either provides a healthy endpoint or rejects.

REJECT

Fewer than min_providers_quorum healthy providers (RPC_QUORUM_LOST).

WARNING_ONLY

Warn when only min_providers_quorum providers remain healthy.

13. Standard Decision Output

This bot returns a RiskVote object. See RiskVote schema.

{
  "vote_id": "sec.rpc_failover_manager.20260509T170000Z",
  "decision": "APPROVE",
  "reason_code": null,
  "evidence": {
    "primary_provider": "alchemy-polygon-1",
    "healthy_count": 3,
    "quarantined_count": 0,
    "max_lag_blocks": 1
  },
  "checked_at": "2026-05-09T17:00:00Z"
}

14. Reason Codes

CodeSeverityMeaningActionUser-facing message
KILL_SWITCH_ACTIVEHARD_REJECTGlobal kill switch is active.Immediately return DENY.Trading is currently paused.
RPC_QUORUM_LOSTHARD_REJECTFewer than min_providers_quorum healthy RPC providers available.Return DENY on all chain reads until quorum restored.The network connection is degraded. Orders are paused.
RPC_PROVIDER_LAGGINGWARNA provider's block height lags by 2 blocks; approaching quarantine threshold.Log warn; keep provider active; increase probe frequency.Network connectivity is slightly degraded.
RPC_QUORUM_WARNWARNOnly min_providers_quorum providers remain healthy; one more failure triggers reject.Emit warn; notify ops.Network connectivity is limited.
RPC_FAILOVER_INFOINFOPrimary provider switched to a new endpoint.Log info; no action needed.Network provider updated automatically.

15. Metrics & Logs

Metrics emitted

MetricTypeUnitLabelsMeaning
polytraders_sec_rpcfailovermanager_healthy_providersgaugecountNumber of currently healthy providers.
polytraders_sec_rpcfailovermanager_block_laggaugeblocksproviderCurrent block lag per provider relative to maximum observed.
polytraders_sec_rpcfailovermanager_failovers_totalcountercountNumber of primary provider switches.
polytraders_sec_rpcfailovermanager_probe_latency_mshistogrammsproviderLatency of eth_blockNumber probe per provider.

Alerts

AlertConditionSeverityRunbook
RPCQuorumLostpolytraders_sec_rpcfailovermanager_healthy_providers < min_providers_quorumP0#runbook-rpc-quorum-lost
RPCHighFailoverRaterate(polytraders_sec_rpcfailovermanager_failovers_total[5m]) > 2P1#runbook-rpc-failover-rate

16. Developer Reporting

{
  "bot_id": "sec.rpc_failover_manager",
  "decision": "APPROVE",
  "inputs_used": [
    "onchain.eth_blockNumber",
    "config.provider_pool"
  ],
  "checked_at": "2026-05-09T17:00:00Z"
}

17. Plain-English Reporting

SituationUser-facing explanation
Orders paused — RPC quorum lostThe network connection is degraded. Orders are paused until connectivity is restored.
Provider failoverThe primary network provider was switched automatically. No action needed.
Provider quarantinedOne of the network providers was temporarily taken offline. Others are being used.

18. Failure-Mode Block

main_failure_modeAll providers simultaneously stale, causing chain state blindness for all on-chain checks.
false_positive_riskA brief network hiccup quarantines providers unnecessarily, causing trading pause until re-probe succeeds.
false_negative_riskTwo providers both stale at the same height pass the quorum check but provide wrong data.
safe_fallbackIf fewer than min_providers_quorum healthy providers: fail-closed on all chain reads; emit RPC_QUORUM_LOST.
required_dependenciesConfigured RPC provider pool (at least 3 recommended), Admin UI config, KillSwitch

19. Failure-Injection Recipes

ScenarioHow to injectExpected behaviourRecovery
PRIMARY_PROVIDER_STALEStop block production on primary provider (simulate stale)Automatic when provider resumes producing blocks.
QUORUM_LOSTQuarantine all but 1 providerProviders recover; quorum restored on next probe.
ALL_PROVIDERS_DOWNBlock all RPC endpointsManual provider pool update or network restoration.

20. State & Persistence

Cold-start recovery

Re-probe all providers on restart; no persistent state required.

21. Concurrency & Idempotency

AspectSpecification
Execution modelbackground probe loop + sync per-request lookup
Max in-flight10
Idempotency keyprobe_timestamp
Per-call timeout (ms)1000
Backpressure strategydrop probe if previous not complete
Locking / mutual exclusionread-write lock on provider health state

22. Dependencies

Depends on (must run first)

BotWhyContract
risk.kill_switchKillSwitch halts all probes.DENY(KILL_SWITCH_ACTIVE) on all reads.

Emits to (downstream consumers)

BotWhyContract
sec.chain_state_verifierProvides healthy RPC endpoint for chain state reads.Primary provider elected by RPCFailoverManager used by ChainStateVerifier.
gov.builder_attributionLog failover events.GovernanceLog entry on each failover.

Sibling bots (same OrderIntent)

Used by (auto-aggregated)

5.2 5.7

External services

ServiceEndpointSLA assumedOn failure
Polygon RPC poolConfigured provider endpointsbest-effort per providerQuarantine and failover; DENY if quorum lost.

23. Security Surfaces

Abuse vectors considered

  • Compromised RPC provider returning fraudulent block heights to pass quorum
  • BGP hijack routing traffic to malicious RPC node

Mitigations

  • Quorum of min_providers_quorum providers required; single provider cannot pass unilaterally
  • Provider pool configured in Admin UI with TLS-pinned endpoints

24. Polymarket V2 Compatibility

AspectValue
CLOB versionv2
Collateral assetpUSD
EIP-712 Exchange domain version2
Aware of builderCode fieldno
Aware of negative-risk marketsno
Multi-chain readyno
SDK usedpy-clob-client-v2
Settlement contractCTFExchangeV2
NotesManages Polygon RPC provider health to ensure CTFExchangeV2 and pUSD contract reads are from a fresh block.

API surfaces declared

onchaininternal

Networks supported

polygon

25. Versioning & Migration

FieldValue
spec2.0.0
implementation0.1.0
schema2
releasedNone
planned_releaseQ3-2026

Migration history

DateFromToReasonAction taken
2026-04-28n/av2-specSpec drafted post-CLOB-V2 cutover; bot not yet implementedDesigned against V2 schema (pUSD, builder codes, V2 EIP-712 domain)

26. Acceptance Tests

Unit Tests

TestSetupExpected result
Approve when primary provider has lag < max_block_lagprovider lag=1, max_block_lag=3APPROVE; primary returned
Quarantine provider with lag >= max_block_lagprovider lag=4, max_block_lag=3, auto_quarantine=trueProvider quarantined; failover to secondary
Reject when healthy providers < min_providers_quorumonly 1 healthy provider, min_providers_quorum=2DENY(RPC_QUORUM_LOST)

Integration Tests

TestExpected result
Quarantined provider restored after lag normalisesProvider un-quarantined on next probe with lag < max_block_lag
KillSwitch halts all probes and rejects chain readsDENY(KILL_SWITCH_ACTIVE) on all reads

Property Tests

PropertyRequired behaviour
healthy_count < min_providers_quorum always produces DENYAlways true
Primary provider always has lowest lag among healthy setAlways true

27. Operational Runbook

RPCQuorumLost is a P0 event — all chain-dependent checks are blocked. Restore provider connectivity immediately.

On-call actions

AlertFirst stepDiagnosisMitigationEscalate to
RPCQuorumLost
RPCHighFailoverRate

Manual overrides

Healthcheck

GET /internal/health/rpcfailovermanager → green if At least min_providers_quorum healthy providers; primary elected within last probe_interval_s.; red if healthy_count < min_providers_quorum or no probe completed in last 30s.

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

GateHow measuredThreshold
Provider quarantine and failover logic tested with simulated lagCI integration test100% pass

Promote to Limited live

GateHow measuredThreshold
Quorum-lost injection test fires DENY and alert correctlyFailure injection testPass

Promote to General live

GateHow measuredThreshold
Zero RPCQuorumLost alerts in 48h shadow with real provider poolGrafana alert history0 alerts

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

RequirementStatus
Purpose defined✓ done
Required inputs listed✓ done
Parameters defined✓ done
Defaults defined✓ done
Warning thresholds defined✓ done
Hard thresholds defined✓ done
Safe fallback defined✓ done
Structured output defined✓ done
Developer log defined✓ done
Plain-English explanation✓ done
Unit tests defined✓ done
Integration tests defined✓ done
Property tests defined✓ done
Failure-mode block complete✓ done
Reference implementation pseudocode✓ done
Wire examples (input + output)✓ done
Reason codes listed✓ done
Metrics & logs defined✓ done
State & persistence defined✓ done
Concurrency & idempotency defined✓ done
Dependencies declared✓ done
Security surfaces declared✓ done
Polymarket V2 compatibility declared✓ done
Version & migration history declared✓ done
Operational runbook defined✓ done
Promotion gates defined✓ done
Failure-injection recipes defined✓ done