Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board
HomeBy LayerGovernance6.2 Health & Heartbeat

6.2 Health & Heartbeat

Governance Governance Service Explain LIVE General live capital · Indirect P3 · Reporting & event store pending stub

HealthHeartbeat monitors the liveness of all 97 production bots by polling each bot's internal health endpoint at a configurable interval. If a bot misses missed_heartbeats_to_alert consecutive polls, HealthHeartbeat emits a page-severity alert and optionally triggers an auto-restart. It emits an OperationsReport after every sweep cycle summarising bot health across all layers. Internal-only — no external API surface.

v3 readiness

Docs27/27
donehow scored
Impl0/15
pendinghow scored
Backtest0/4
pendinghow scored
Runtime0/8
pendinghow scored

A bot is done when all four scores are. What does done mean?

1. Bot Identity

LayerGovernance  Governance
Bot classGovernance Service
AuthorityExplain
StatusLIVE
ReadinessGeneral live
Runs beforeEvery bot lifecycle decision — HealthHeartbeat must confirm liveness before strategy logic executes
Runs afterSystem startup; triggered on CronRunner schedule (every heartbeat_interval_s)
Applies toAll 97 production bots across all layers
Default modegeneral_live
User-visibleAdvanced details only
Developer ownerPolytraders core — Governance pod

2. Purpose

HealthHeartbeat monitors the liveness of all 97 production bots by polling each bot's internal health endpoint at a configurable interval. If a bot misses missed_heartbeats_to_alert consecutive polls, HealthHeartbeat emits a page-severity alert and optionally triggers an auto-restart. It emits an OperationsReport after every sweep cycle summarising bot health across all layers. Internal-only — no external API surface.

3. Why This Bot Matters

  • A bot crashes silently without HealthHeartbeat running

    The dead bot's layer is unguarded. Risk votes, kill-switch checks, or execution guards may stop firing, allowing uncontrolled order flow.

  • Auto-restart fires for a bot in a crash-loop

    Repeated restarts mask a systemic failure and exhaust restart budgets. Without a circuit breaker, the governance layer itself degrades.

  • Alert not fired on missed heartbeats

    On-call is not paged. The dead bot may go unnoticed for hours, accumulating unmonitored risk exposure.

  • HealthHeartbeat itself is not monitored

    The watchdog is unwatched. A dead HealthHeartbeat means all 97 bots run without liveness supervision.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

InputSourceRequired?Use
None — all inputs are internalinternalNoHealthHeartbeat does not consume any Polymarket API surface directly.

5. Required Internal Inputs

InputSourceRequired?Use
Bot health endpoints — GET /internal/health/<slug>All 97 production botsYesPrimary liveness signal. A 200 response within timeout_ms is a live heartbeat.
Bot registry — list of all bot slugs, layers, and restart configsConfig storeYesDefines the set of bots to monitor and their per-bot restart and alerting rules.
Restart executor — internal command bus topic for restart triggersProcess managerNoWhen auto_restart=true, HealthHeartbeat publishes a restart command to the process manager after missed_heartbeats_to_alert consecutive misses.

6. Parameter Guide

ParameterDefaultWarningHardWhat it controls
heartbeat_interval_s30120300How often (in seconds) HealthHeartbeat polls each bot's health endpoint.
missed_heartbeats_to_alert3510Number of consecutive missed polls before an alert is fired.
auto_restartTrueNoneNoneWhen true, HealthHeartbeat triggers a restart command after missed_heartbeats_to_alert consecutive failures. Respects a per-bot restart budget.
page_on_failureTrueNoneNoneWhen true (locked), a page-severity alert is fired for any bot that exceeds the missed heartbeat threshold.

7. Detailed Parameter Instructions

heartbeat_interval_s

What it means

How often (in seconds) HealthHeartbeat polls each bot's health endpoint.

Default

{ "heartbeat_interval_s": 30 }

Why this default matters

30s gives a 90s detection window for a 3-miss threshold. Increasing beyond 120s delays alerting significantly.

Threshold logic

ConditionAction
heartbeat_interval_s <= 30Normal monitoring
30–120sWARN — detection latency increased
> 300sReject config change — PARAMETER_CHANGE_REQUIRES_APPROVAL

Developer check

if (p.heartbeat_interval_s > p.hard) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')

User-facing English

The system checks that all components are running regularly.

missed_heartbeats_to_alert

What it means

Number of consecutive missed polls before an alert is fired.

Default

{ "missed_heartbeats_to_alert": 3 }

Why this default matters

3 consecutive misses (90s at default interval) is enough to distinguish a transient blip from a real crash.

Threshold logic

ConditionAction
missed <= 3Normal tolerance
4–10WARN — alert latency increased
> 10Reject — PARAMETER_CHANGE_REQUIRES_APPROVAL

Developer check

if (p.missed_heartbeats_to_alert > p.hard) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')

User-facing English

A component is flagged as unhealthy only after multiple consecutive check failures, to avoid false alarms.

auto_restart

What it means

When true, HealthHeartbeat triggers a restart command after missed_heartbeats_to_alert consecutive failures. Respects a per-bot restart budget.

Default

{ "auto_restart": true }

Why this default matters

Auto-restart recovers from transient crashes without manual intervention, minimising downtime for governance bots.

Threshold logic

ConditionAction
auto_restart=true AND misses >= thresholdPublish restart command; emit HEALTH_HEARTBEAT_AUTO_RESTART
restart_budget exhaustedEmit HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED; page on-call without restarting

Developer check

if (p.auto_restart && misses >= p.missed_heartbeats_to_alert) triggerRestart(bot_slug)

User-facing English

If a component stops responding, the system will attempt to restart it automatically.

page_on_failure

What it means

When true (locked), a page-severity alert is fired for any bot that exceeds the missed heartbeat threshold.

Default

{ "page_on_failure": true }

Why this default matters

Every bot that stops heartbeating is a potential live incident. Paging is mandatory.

Threshold logic

ConditionAction
page_on_failure=true AND misses >= thresholdFire page-severity alert
page_on_failure=falseNot permitted — parameter is locked to true

Developer check

if (!p.page_on_failure) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')

User-facing English

Critical system components are monitored by an on-call team.

8. Default Configuration

{
  "bot_id": "gov.health_heartbeat",
  "version": "2.0.0",
  "mode": "general_live",
  "defaults": {
    "heartbeat_interval_s": 30,
    "missed_heartbeats_to_alert": 3,
    "auto_restart": true,
    "page_on_failure": true
  },
  "locked": {
    "page_on_failure": {
      "immutable": true
    },
    "heartbeat_interval_s": {
      "max": 300
    },
    "missed_heartbeats_to_alert": {
      "max": 10
    }
  }
}

9. Implementation Flow

  1. On startup, load the bot registry from the config store; build a polling table keyed by bot_slug with miss_count=0.
  2. Every heartbeat_interval_s, iterate over all registered bots and issue GET /internal/health/<slug> with a timeout of heartbeat_interval_s/3.
  3. For each bot: if response is 200 within timeout, reset miss_count to 0 and emit INFO heartbeat.
  4. If response is non-200 or times out, increment miss_count.
  5. When miss_count >= missed_heartbeats_to_alert: emit page alert (HEALTH_HEARTBEAT_BOT_DOWN) and, if auto_restart=true, publish restart command to the process manager.
  6. Track restart budget per bot (default 3 restarts per 10 minutes). If budget is exhausted, emit HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED and stop auto-restarting.
  7. After each full sweep, emit an OperationsReport summarising: total_bots, healthy_count, unhealthy_count, restarted_count, sweep_duration_ms.
  8. HealthHeartbeat itself is monitored by a watchdog process (deadman timer) that pages if no OperationsReport is emitted within 2x heartbeat_interval_s.

10. Reference Implementation

Polls all 97 registered bots' health endpoints every heartbeat_interval_s, tracks consecutive misses, fires alerts and auto-restarts at threshold, emits a sweep OperationsReport after each cycle.

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. Translate to TS/Python/Go/Rust.

// ---- STARTUP ----
FUNCTION init():
  registry = FETCH config_store.GET('/bot-registry')
  miss_counts = { slug: 0 FOR slug IN registry }
  restart_budgets = { slug: { count: 0, window_start: now() } FOR slug IN registry }
  setInterval(runSweep, config.heartbeat_interval_s * 1000)

// ---- SWEEP ----
FUNCTION runSweep():
  sweep_start = now()
  healthy = 0; unhealthy = 0; restarted = 0
  unhealthy_bots = []

  FOR bot IN registry:
    response = FETCH GET '/internal/health/' + bot.slug
      TIMEOUT config.heartbeat_interval_s / 3 * 1000

    IF response.status == 200:
      IF miss_counts[bot.slug] >= config.missed_heartbeats_to_alert:
        EMIT alert(HEALTH_HEARTBEAT_BOT_RECOVERED, bot.slug)
      miss_counts[bot.slug] = 0
      healthy += 1
    ELSE:
      miss_counts[bot.slug] += 1
      unhealthy += 1

      IF miss_counts[bot.slug] >= config.missed_heartbeats_to_alert:
        alerting.emit('HEALTH_HEARTBEAT_BOT_DOWN', {
          slug: bot.slug, miss_count: miss_counts[bot.slug] })

        IF config.auto_restart:
          budget = restart_budgets[bot.slug]
          IF (now() - budget.window_start) > 600_000:  // 10-min window
            budget.count = 0; budget.window_start = now()
          IF budget.count < 3:
            internal_bus.publish('process.restart', { slug: bot.slug })
            budget.count += 1; restarted += 1
            unhealthy_bots.append({ slug: bot.slug, miss_count: miss_counts[bot.slug], action: 'restarted' })
          ELSE:
            alerting.emit('HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED', { slug: bot.slug })
            unhealthy_bots.append({ slug: bot.slug, miss_count: miss_counts[bot.slug], action: 'budget_exhausted' })

  EMIT OperationsReport({
    report_id:         'ops_health_' + sweep_start,
    event_type:        'HEALTH_SWEEP_COMPLETE',
    total_bots:        len(registry),
    healthy_count:     healthy,
    unhealthy_count:   unhealthy,
    restarted_count:   restarted,
    sweep_duration_ms: now() - sweep_start,
    unhealthy_bots:    unhealthy_bots,
    fired_at_ms:       sweep_start
  })

SDK calls used

  • config_store.GET('/bot-registry')
  • FETCH GET '/internal/health/<slug>' TIMEOUT <ms>
  • internal_bus.publish('process.restart', { slug })
  • alerting.emit('HEALTH_HEARTBEAT_BOT_DOWN', metadata)
  • alerting.emit('HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED', metadata)

Complexity: O(N) per sweep where N = 97 registered bots

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Health endpoint poll response",
  "source": "internal /internal/health/<slug>",
  "payload": {
    "slug": "strat.some_strategy",
    "status": "ok",
    "last_decision_ms": 1746791970000,
    "uptime_s": 86400
  }
}

Output — what the bot emits

{
  "label": "OperationsReport — HEALTH_SWEEP_COMPLETE",
  "payload": {
    "report_id": "ops_health_1746792000000",
    "bot_id": "gov.health_heartbeat",
    "event_type": "HEALTH_SWEEP_COMPLETE",
    "total_bots": 97,
    "healthy_count": 96,
    "unhealthy_count": 1,
    "restarted_count": 1,
    "sweep_duration_ms": 840,
    "unhealthy_bots": [
      {
        "slug": "strat.some_strategy",
        "miss_count": 3,
        "action": "restarted"
      }
    ],
    "fired_at_ms": 1746792000000,
    "report_kind": "OperationsReport"
  }
}

12. Decision Logic

APPROVE

Not applicable — HealthHeartbeat does not approve or reject trading decisions.

RESHAPE_REQUIRED

Not applicable.

REJECT

Not applicable as a trading decision.

WARNING_ONLY

A single missed heartbeat increments the miss counter but does not fire an alert. Only consecutive misses at or above the threshold trigger an alert or restart.

13. Standard Decision Output

This bot returns a OperationsReport object. See OperationsReport schema.

{
  "report_id": "ops_health_20260509T120000Z",
  "bot_id": "gov.health_heartbeat",
  "event_type": "HEALTH_SWEEP_COMPLETE",
  "total_bots": 97,
  "healthy_count": 96,
  "unhealthy_count": 1,
  "restarted_count": 1,
  "sweep_duration_ms": 840,
  "unhealthy_bots": [
    {
      "slug": "strat.some_strategy",
      "miss_count": 3,
      "action": "restarted"
    }
  ],
  "fired_at_ms": 1746792000000,
  "report_kind": "OperationsReport"
}

14. Reason Codes

CodeSeverityMeaningActionUser-facing message
HEALTH_HEARTBEAT_SWEEP_COMPLETEINFOFull sweep of all registered bots completed; OperationsReport emitted.No action — routine heartbeat.
HEALTH_HEARTBEAT_BOT_DOWNWARNA bot has exceeded the missed_heartbeats_to_alert threshold of consecutive missed polls.Fire page-severity alert; trigger auto-restart if enabled.A system component is not responding. The on-call team has been notified.
HEALTH_HEARTBEAT_BOT_RECOVEREDINFOA previously unhealthy bot returned a healthy response; miss_count reset to 0.Emit recovery notification; no further action.A component that was restarted is now healthy.
HEALTH_HEARTBEAT_AUTO_RESTARTWARNCronRunner triggered an automatic restart for a bot that missed the heartbeat threshold.Log restart; increment restart budget counter.A component was automatically restarted.
HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTEDWARNA bot has been restarted the maximum number of times within the restart budget window without recovering.Stop auto-restarting; escalate page to on-call.Automatic restart attempts have been exhausted for a component. Manual intervention is required.
HEALTH_HEARTBEAT_ENDPOINT_TIMEOUTWARNA bot's health endpoint did not respond within the configured timeout.Treat as missed heartbeat; increment miss_count.
KILL_SWITCH_ACTIVEWARNKillSwitch is active; this is surfaced in the sweep report for context.Continue monitoring all bots; do not suppress health checks.
HEALTH_HEARTBEAT_REGISTRY_STALEWARNThe bot registry has not been refreshed from the config store within 5 minutes.Retry registry fetch; alert if stale for > 10 minutes.

15. Metrics & Logs

Metrics emitted

MetricTypeUnitLabelsMeaning
polytraders_gov_healthheartbeat_bots_healthygaugecountNumber of bots currently in healthy state.
polytraders_gov_healthheartbeat_bots_unhealthygaugecountNumber of bots currently in unhealthy state (above miss threshold).
polytraders_gov_healthheartbeat_restarts_totalcountercountslugTotal auto-restarts triggered per bot slug.
polytraders_gov_healthheartbeat_misses_totalcountercountslugTotal missed heartbeat polls per bot slug.
polytraders_gov_healthheartbeat_sweep_duration_mshistogrammsWall-clock latency of a full 97-bot sweep cycle.
polytraders_gov_healthheartbeat_sweeps_totalcountercountTotal sweep cycles completed.

Alerts

AlertConditionSeverityRunbook
HealthHeartbeatBotDownpolytraders_gov_healthheartbeat_bots_unhealthy > 0page#runbook-healthheartbeat-bot-down
HealthHeartbeatRestartBudgetExhaustedrate(polytraders_gov_healthheartbeat_restarts_total[10m]) > 3page#runbook-healthheartbeat-restart-budget
HealthHeartbeatSweepMissingrate(polytraders_gov_healthheartbeat_sweeps_total[5m]) == 0page#runbook-healthheartbeat-missing
HealthHeartbeatSweepLatencyHighhistogram_quantile(0.99, polytraders_gov_healthheartbeat_sweep_duration_ms) > 25000warn#runbook-healthheartbeat-latency

Dashboards

  • Grafana — Governance / HealthHeartbeat liveness overview (all 97 bots)
  • Grafana — Governance / Auto-restart rate and budget consumption

16. Developer Reporting

{
  "bot_id": "gov.health_heartbeat",
  "event_type": "HEALTH_BOT_MISS",
  "slug": "strat.some_strategy",
  "miss_count": 2,
  "threshold": 3,
  "last_seen_ms": 1746791940000,
  "fired_at_ms": 1746791970000
}

17. Plain-English Reporting

SituationUser-facing explanation
All bots healthyAll system components passed their health checks. Everything is running normally.
A bot was auto-restartedA component stopped responding and was automatically restarted. Trading and risk monitoring continued without interruption.
A bot is down and restart budget exhaustedA component is not responding and automatic restart attempts have been exhausted. The on-call team has been notified.

18. Failure-Mode Block

main_failure_modeHealthHeartbeat itself crashes, silently leaving all 97 bots unmonitored. Requires an external deadman watchdog.
false_positive_riskA healthy bot's health endpoint returns 503 transiently (e.g., during a rolling restart), triggering a spurious miss counter increment.
false_negative_riskA bot crashes but its health endpoint continues to respond 200 from a zombie process that has stopped processing events — HealthHeartbeat sees it as healthy.
safe_fallbackIf HealthHeartbeat cannot reach a bot's health endpoint due to a network partition, it increments miss_count normally and fires the alert after the threshold. The bot is never silently marked healthy on connectivity loss.
required_dependenciesBot registry (config store), Internal health endpoints on all 97 bots, Process manager (for auto-restart commands), Alerting / paging system, Deadman watchdog for HealthHeartbeat itself

19. Failure-Injection Recipes

ScenarioHow to injectExpected behaviourRecovery
BOT_CRASHKill a bot process so its health endpoint stops respondingmiss_count increments each poll; after missed_heartbeats_to_alert misses, HEALTH_HEARTBEAT_BOT_DOWN alert fires and restart is triggeredBot restarts; miss_count resets to 0; HEALTH_HEARTBEAT_BOT_RECOVERED emitted.
RESTART_BUDGET_EXHAUSTEDRepeatedly kill a bot faster than restart_budget window (3 crashes in < 10 min)Third restart fires; fourth missed threshold triggers HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED; no further auto-restartManual intervention required; budget resets after 10-minute window.
HEALTH_HEARTBEAT_SELF_CRASHKill HealthHeartbeat processDeadman watchdog fires page after 2x heartbeat_interval_s without a sweep OperationsReportHealthHeartbeat is restarted by the process manager; sweep resumes; miss counts reinitialised.
ENDPOINT_TIMEOUTSet a mock health endpoint to respond after 30s (beyond timeout)HEALTH_HEARTBEAT_ENDPOINT_TIMEOUT logged; miss_count incrementedWhen endpoint responds within timeout, miss_count resets.
NETWORK_PARTITIONBlock internal network between HealthHeartbeat and a subset of botsAffected bots' miss counts increment; alert fires at threshold; restart attempted (network partition means restart may not help)Network restored; bots return to healthy; miss counts reset.

20. State & Persistence

Cold-start recovery

On restart, all miss_counts reset to 0. The first sweep re-establishes the health baseline.

21. Concurrency & Idempotency

AspectSpecification
Execution modelthread-pool (one HTTP poll per bot in parallel)
Max in-flight97
Idempotency keyslug + sweep_start_ms
Per-call timeout (ms)10000
Backpressure strategycap parallel polls at max_in_flight=97; excess queued to next sweep
Locking / mutual exclusionper-slug mutex on miss_counts and restart_budgets

22. Dependencies

Depends on (must run first)

BotWhyContract
internal.config_storeBot registry is loaded from config store on startup.

Emits to (downstream consumers)

BotWhyContract
internal.process_manager

Sibling bots (same OrderIntent)

BotWhyContract
gov.cron_runnerCronRunner fires the hourly health sweep trigger.

External services

ServiceEndpointSLA assumedOn failure
Alerting / paging system99.9% (internal SRE target)

23. Security Surfaces

Abuse vectors considered

  • A bot returns a fake 200 response from a zombie process to avoid restart
  • Raising missed_heartbeats_to_alert to a very high value to prevent alerts from firing
  • Disabling page_on_failure to suppress alerting

Mitigations

  • page_on_failure is locked immutable; cannot be disabled
  • heartbeat_interval_s and missed_heartbeats_to_alert have hard maximums enforced at config load
  • Health endpoint responses are checked for a valid JSON body, not just HTTP status
  • HealthHeartbeat itself is monitored by an external deadman watchdog

24. Polymarket V2 Compatibility

AspectValue
CLOB versionv2
Collateral assetpUSD
EIP-712 Exchange domain version2
Aware of builderCode fieldno
Aware of negative-risk marketsno
Multi-chain readyno
SDK usedinternal-only
Settlement contractnone
NotesHealthHeartbeat monitors liveness of all bots including V2-aware ones but has no direct CLOB or on-chain interface itself.

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

FieldValue
spec2.0.0
implementation2.1.0
schema2
released2026-04-28

Migration history

DateFromToReasonAction taken
2026-04-28v1v2CLOB V2 cutoverNo direct CLOB changes required. Updated OperationsReport schema; removed stale USDC.e references from sweep report payloads. Added V2-aware bots to the monitoring registry.

26. Acceptance Tests

Unit Tests

TestSetupExpected result
miss_count increments on non-200 responseMock health endpoint returns 503miss_count incremented; no alert below threshold
Alert fires at thresholdmiss_count == missed_heartbeats_to_alertHEALTH_HEARTBEAT_BOT_DOWN alert emitted; restart triggered if auto_restart=true
Restart budget enforced3 restarts in 10 minutes for same bot4th restart blocked; HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED emitted
miss_count resets on recoveryBot returns to 200 after 2 missesmiss_count reset to 0; HEALTH_HEARTBEAT_BOT_RECOVERED emitted
heartbeat_interval_s above hard maximum rejectedheartbeat_interval_s=400ConfigError PARAMETER_CHANGE_REQUIRES_APPROVAL

Integration Tests

TestExpected result
Full sweep of all 97 bots completes within heartbeat_interval_sOperationsReport emitted with total_bots=97 within configured interval
Auto-restart command delivered to process managerRestart command published; bot restarts; miss_count resets on recovery

Property Tests

PropertyRequired behaviour
Every missed heartbeat increments miss_count; no miss is silently droppedAlways true
An OperationsReport is emitted after every sweep cycleAlways true

27. Operational Runbook

HealthHeartbeat incidents are either a bot going down (most common), the restart budget exhausting on a crash-looping bot, or HealthHeartbeat itself failing. All three require immediate response.

On-call actions

AlertFirst stepDiagnosisMitigationEscalate to
HealthHeartbeatBotDownIdentify which bot(s) are unhealthy from the sweep OperationsReport. Check bot logs for crash details.Layer pod lead for the affected bot
HealthHeartbeatRestartBudgetExhaustedDo NOT manually restart the bot without investigating crash logs. Check for crash-loop root cause.Layer pod lead + SRE on-call immediately
HealthHeartbeatSweepMissingCheck HealthHeartbeat process status; verify deadman watchdog is running.Governance pod lead immediately
HealthHeartbeatSweepLatencyHighCheck internal network latency to bot health endpoints; reduce parallel poll count if overloaded.SRE on-call after 30 minutes

Manual overrides

  • polytraders gov health pause-restart --slug <slug> — Stop auto-restart for a specific bot while investigating a crash-loop.

Healthcheck

Endpoint: /internal/health/health-heartbeat | Green: Last sweep completed within 2x heartbeat_interval_s; all bots polled; OperationsReport emitted. | Red: No sweep in 2x heartbeat_interval_s; registry load failed; process unresponsive.

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

GateHow measuredThreshold
Unit tests pass for miss counting, alert threshold, and restart budgetCI test run100% pass

Promote to Limited live

GateHow measuredThreshold
Full 97-bot sweep completes within heartbeat_interval_s under normal loadpolytraders_gov_healthheartbeat_sweep_duration_ms histogram< 30s sweep for 97 bots

Promote to General live

GateHow measuredThreshold
End-to-end: bot crash detected and auto-restarted within 3 sweep cyclesFailure injection testPass
Restart budget exhaustion alert fires and stops further restartsFailure injection testPass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

RequirementStatus
Purpose defined✓ done
Required inputs listed✓ done
Parameters defined✓ done
Defaults defined✓ done
Warning thresholds defined✓ done
Hard thresholds defined✓ done
Safe fallback defined✓ done
Structured output defined✓ done
Developer log defined✓ done
Plain-English explanation✓ done
Unit tests defined✓ done
Integration tests defined✓ done
Property tests defined✓ done
Failure-mode block complete✓ done
Reference implementation pseudocode✓ done
Wire examples (input + output)✓ done
Reason codes listed✓ done
Metrics & logs defined✓ done
State & persistence defined✓ done
Concurrency & idempotency defined✓ done
Dependencies declared✓ done
Security surfaces declared✓ done
Polymarket V2 compatibility declared✓ done
Version & migration history declared✓ done
Operational runbook defined✓ done
Promotion gates defined✓ done
Failure-injection recipes defined✓ done