Home › By Layer › Governance › 6.2 Health & Heartbeat

6.2 Health & Heartbeat

Governance Governance Service Explain LIVE General live capital · Indirect P3 · Reporting & event store ○ pending stub

HealthHeartbeat monitors the liveness of all 97 production bots by polling each bot's internal health endpoint at a configurable interval. If a bot misses missed_heartbeats_to_alert consecutive polls, HealthHeartbeat emits a page-severity alert and optionally triggers an auto-restart. It emits an OperationsReport after every sweep cycle summarising bot health across all layers. Internal-only — no external API surface.

v3 readiness

Docs27/27

donehow scored

Impl0/15

pendinghow scored

Backtest0/4

pendinghow scored

Runtime0/8

pendinghow scored

A bot is done when all four scores are. What does done mean?

← 6.1 BuilderAttribution 6.3 PnL Reporter →

1. Bot Identity

Layer	Governance Governance
Bot class	Governance Service
Authority	Explain
Status	LIVE
Readiness	General live
Runs before	Every bot lifecycle decision — HealthHeartbeat must confirm liveness before strategy logic executes
Runs after	System startup; triggered on CronRunner schedule (every heartbeat_interval_s)
Applies to	All 97 production bots across all layers
Default mode	`general_live`
User-visible	Advanced details only
Developer owner	Polytraders core — Governance pod

2. Purpose

3. Why This Bot Matters

A bot crashes silently without HealthHeartbeat running
The dead bot's layer is unguarded. Risk votes, kill-switch checks, or execution guards may stop firing, allowing uncontrolled order flow.
Auto-restart fires for a bot in a crash-loop
Repeated restarts mask a systemic failure and exhaust restart budgets. Without a circuit breaker, the governance layer itself degrades.
Alert not fired on missed heartbeats
On-call is not paged. The dead bot may go unnoticed for hours, accumulating unmonitored risk exposure.
HealthHeartbeat itself is not monitored
The watchdog is unwatched. A dead HealthHeartbeat means all 97 bots run without liveness supervision.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

Input	Source	Required?	Use
None — all inputs are internal	`internal`	No	HealthHeartbeat does not consume any Polymarket API surface directly.

5. Required Internal Inputs

Input	Source	Required?	Use
Bot health endpoints — GET /internal/health/<slug>	`All 97 production bots`	Yes	Primary liveness signal. A 200 response within timeout_ms is a live heartbeat.
Bot registry — list of all bot slugs, layers, and restart configs	`Config store`	Yes	Defines the set of bots to monitor and their per-bot restart and alerting rules.
Restart executor — internal command bus topic for restart triggers	`Process manager`	No	When auto_restart=true, HealthHeartbeat publishes a restart command to the process manager after missed_heartbeats_to_alert consecutive misses.

6. Parameter Guide

Parameter	Default	Warning	Hard	What it controls
heartbeat_interval_s	`30`	`120`	`300`	How often (in seconds) HealthHeartbeat polls each bot's health endpoint.
missed_heartbeats_to_alert	`3`	`5`	`10`	Number of consecutive missed polls before an alert is fired.
auto_restart	`True`	`None`	`None`	When true, HealthHeartbeat triggers a restart command after missed_heartbeats_to_alert consecutive failures. Respects a per-bot restart budget.
page_on_failure	`True`	`None`	`None`	When true (locked), a page-severity alert is fired for any bot that exceeds the missed heartbeat threshold.

7. Detailed Parameter Instructions

heartbeat_interval_s

What it means

How often (in seconds) HealthHeartbeat polls each bot's health endpoint.

Default

{ "heartbeat_interval_s": 30 }

Why this default matters

30s gives a 90s detection window for a 3-miss threshold. Increasing beyond 120s delays alerting significantly.

Threshold logic

Condition	Action
heartbeat_interval_s <= 30	Normal monitoring
30–120s	WARN — detection latency increased
> 300s	Reject config change — PARAMETER_CHANGE_REQUIRES_APPROVAL

Developer check

if (p.heartbeat_interval_s > p.hard) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')

User-facing English

The system checks that all components are running regularly.

missed_heartbeats_to_alert

What it means

Number of consecutive missed polls before an alert is fired.

Default

{ "missed_heartbeats_to_alert": 3 }

Why this default matters

3 consecutive misses (90s at default interval) is enough to distinguish a transient blip from a real crash.

Threshold logic

Condition	Action
missed <= 3	Normal tolerance
4–10	WARN — alert latency increased
> 10	Reject — PARAMETER_CHANGE_REQUIRES_APPROVAL

Developer check

if (p.missed_heartbeats_to_alert > p.hard) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')

User-facing English

A component is flagged as unhealthy only after multiple consecutive check failures, to avoid false alarms.

auto_restart

What it means

When true, HealthHeartbeat triggers a restart command after missed_heartbeats_to_alert consecutive failures. Respects a per-bot restart budget.

Default

{ "auto_restart": true }

Why this default matters

Auto-restart recovers from transient crashes without manual intervention, minimising downtime for governance bots.

Threshold logic

Condition	Action
auto_restart=true AND misses >= threshold	Publish restart command; emit HEALTH_HEARTBEAT_AUTO_RESTART
restart_budget exhausted	Emit HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED; page on-call without restarting

Developer check

if (p.auto_restart && misses >= p.missed_heartbeats_to_alert) triggerRestart(bot_slug)

User-facing English

If a component stops responding, the system will attempt to restart it automatically.

page_on_failure

What it means

When true (locked), a page-severity alert is fired for any bot that exceeds the missed heartbeat threshold.

Default

{ "page_on_failure": true }

Why this default matters

Every bot that stops heartbeating is a potential live incident. Paging is mandatory.

Threshold logic

Condition	Action
page_on_failure=true AND misses >= threshold	Fire page-severity alert
page_on_failure=false	Not permitted — parameter is locked to true

Developer check

if (!p.page_on_failure) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')

User-facing English

Critical system components are monitored by an on-call team.

8. Default Configuration

{
  "bot_id": "gov.health_heartbeat",
  "version": "2.0.0",
  "mode": "general_live",
  "defaults": {
    "heartbeat_interval_s": 30,
    "missed_heartbeats_to_alert": 3,
    "auto_restart": true,
    "page_on_failure": true
  },
  "locked": {
    "page_on_failure": {
      "immutable": true
    },
    "heartbeat_interval_s": {
      "max": 300
    },
    "missed_heartbeats_to_alert": {
      "max": 10
    }
  }
}

9. Implementation Flow

On startup, load the bot registry from the config store; build a polling table keyed by bot_slug with miss_count=0.
Every heartbeat_interval_s, iterate over all registered bots and issue GET /internal/health/<slug> with a timeout of heartbeat_interval_s/3.
For each bot: if response is 200 within timeout, reset miss_count to 0 and emit INFO heartbeat.
If response is non-200 or times out, increment miss_count.
When miss_count >= missed_heartbeats_to_alert: emit page alert (HEALTH_HEARTBEAT_BOT_DOWN) and, if auto_restart=true, publish restart command to the process manager.
Track restart budget per bot (default 3 restarts per 10 minutes). If budget is exhausted, emit HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED and stop auto-restarting.
After each full sweep, emit an OperationsReport summarising: total_bots, healthy_count, unhealthy_count, restarted_count, sweep_duration_ms.
HealthHeartbeat itself is monitored by a watchdog process (deadman timer) that pages if no OperationsReport is emitted within 2x heartbeat_interval_s.

10. Reference Implementation

Polls all 97 registered bots' health endpoints every heartbeat_interval_s, tracks consecutive misses, fires alerts and auto-restarts at threshold, emits a sweep OperationsReport after each cycle.

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. Translate to TS/Python/Go/Rust.

// ---- STARTUP ----
FUNCTION init():
  registry = FETCH config_store.GET('/bot-registry')
  miss_counts = { slug: 0 FOR slug IN registry }
  restart_budgets = { slug: { count: 0, window_start: now() } FOR slug IN registry }
  setInterval(runSweep, config.heartbeat_interval_s * 1000)

// ---- SWEEP ----
FUNCTION runSweep():
  sweep_start = now()
  healthy = 0; unhealthy = 0; restarted = 0
  unhealthy_bots = []

  FOR bot IN registry:
    response = FETCH GET '/internal/health/' + bot.slug
      TIMEOUT config.heartbeat_interval_s / 3 * 1000

    IF response.status == 200:
      IF miss_counts[bot.slug] >= config.missed_heartbeats_to_alert:
        EMIT alert(HEALTH_HEARTBEAT_BOT_RECOVERED, bot.slug)
      miss_counts[bot.slug] = 0
      healthy += 1
    ELSE:
      miss_counts[bot.slug] += 1
      unhealthy += 1

      IF miss_counts[bot.slug] >= config.missed_heartbeats_to_alert:
        alerting.emit('HEALTH_HEARTBEAT_BOT_DOWN', {
          slug: bot.slug, miss_count: miss_counts[bot.slug] })

        IF config.auto_restart:
          budget = restart_budgets[bot.slug]
          IF (now() - budget.window_start) > 600_000:  // 10-min window
            budget.count = 0; budget.window_start = now()
          IF budget.count < 3:
            internal_bus.publish('process.restart', { slug: bot.slug })
            budget.count += 1; restarted += 1
            unhealthy_bots.append({ slug: bot.slug, miss_count: miss_counts[bot.slug], action: 'restarted' })
          ELSE:
            alerting.emit('HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED', { slug: bot.slug })
            unhealthy_bots.append({ slug: bot.slug, miss_count: miss_counts[bot.slug], action: 'budget_exhausted' })

  EMIT OperationsReport({
    report_id:         'ops_health_' + sweep_start,
    event_type:        'HEALTH_SWEEP_COMPLETE',
    total_bots:        len(registry),
    healthy_count:     healthy,
    unhealthy_count:   unhealthy,
    restarted_count:   restarted,
    sweep_duration_ms: now() - sweep_start,
    unhealthy_bots:    unhealthy_bots,
    fired_at_ms:       sweep_start
  })

SDK calls used

config_store.GET('/bot-registry')
FETCH GET '/internal/health/<slug>' TIMEOUT <ms>
internal_bus.publish('process.restart', { slug })
alerting.emit('HEALTH_HEARTBEAT_BOT_DOWN', metadata)
alerting.emit('HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED', metadata)

Complexity: O(N) per sweep where N = 97 registered bots

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Health endpoint poll response",
  "source": "internal /internal/health/<slug>",
  "payload": {
    "slug": "strat.some_strategy",
    "status": "ok",
    "last_decision_ms": 1746791970000,
    "uptime_s": 86400
  }
}

Output — what the bot emits

{
  "label": "OperationsReport — HEALTH_SWEEP_COMPLETE",
  "payload": {
    "report_id": "ops_health_1746792000000",
    "bot_id": "gov.health_heartbeat",
    "event_type": "HEALTH_SWEEP_COMPLETE",
    "total_bots": 97,
    "healthy_count": 96,
    "unhealthy_count": 1,
    "restarted_count": 1,
    "sweep_duration_ms": 840,
    "unhealthy_bots": [
      {
        "slug": "strat.some_strategy",
        "miss_count": 3,
        "action": "restarted"
      }
    ],
    "fired_at_ms": 1746792000000,
    "report_kind": "OperationsReport"
  }
}

12. Decision Logic

APPROVE

Not applicable — HealthHeartbeat does not approve or reject trading decisions.

RESHAPE_REQUIRED

Not applicable.

REJECT

Not applicable as a trading decision.

WARNING_ONLY

A single missed heartbeat increments the miss counter but does not fire an alert. Only consecutive misses at or above the threshold trigger an alert or restart.

13. Standard Decision Output

This bot returns a OperationsReport object. See OperationsReport schema.

{
  "report_id": "ops_health_20260509T120000Z",
  "bot_id": "gov.health_heartbeat",
  "event_type": "HEALTH_SWEEP_COMPLETE",
  "total_bots": 97,
  "healthy_count": 96,
  "unhealthy_count": 1,
  "restarted_count": 1,
  "sweep_duration_ms": 840,
  "unhealthy_bots": [
    {
      "slug": "strat.some_strategy",
      "miss_count": 3,
      "action": "restarted"
    }
  ],
  "fired_at_ms": 1746792000000,
  "report_kind": "OperationsReport"
}

14. Reason Codes

Code	Severity	Meaning	Action	User-facing message
`HEALTH_HEARTBEAT_SWEEP_COMPLETE`	INFO	Full sweep of all registered bots completed; OperationsReport emitted.	No action — routine heartbeat.
`HEALTH_HEARTBEAT_BOT_DOWN`	WARN	A bot has exceeded the missed_heartbeats_to_alert threshold of consecutive missed polls.	Fire page-severity alert; trigger auto-restart if enabled.	A system component is not responding. The on-call team has been notified.
`HEALTH_HEARTBEAT_BOT_RECOVERED`	INFO	A previously unhealthy bot returned a healthy response; miss_count reset to 0.	Emit recovery notification; no further action.	A component that was restarted is now healthy.
`HEALTH_HEARTBEAT_AUTO_RESTART`	WARN	CronRunner triggered an automatic restart for a bot that missed the heartbeat threshold.	Log restart; increment restart budget counter.	A component was automatically restarted.
`HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED`	WARN	A bot has been restarted the maximum number of times within the restart budget window without recovering.	Stop auto-restarting; escalate page to on-call.	Automatic restart attempts have been exhausted for a component. Manual intervention is required.
`HEALTH_HEARTBEAT_ENDPOINT_TIMEOUT`	WARN	A bot's health endpoint did not respond within the configured timeout.	Treat as missed heartbeat; increment miss_count.
`KILL_SWITCH_ACTIVE`	WARN	KillSwitch is active; this is surfaced in the sweep report for context.	Continue monitoring all bots; do not suppress health checks.
`HEALTH_HEARTBEAT_REGISTRY_STALE`	WARN	The bot registry has not been refreshed from the config store within 5 minutes.	Retry registry fetch; alert if stale for > 10 minutes.

15. Metrics & Logs

Metrics emitted

Metric	Type	Unit	Labels	Meaning
`polytraders_gov_healthheartbeat_bots_healthy`	gauge	count		Number of bots currently in healthy state.
`polytraders_gov_healthheartbeat_bots_unhealthy`	gauge	count		Number of bots currently in unhealthy state (above miss threshold).
`polytraders_gov_healthheartbeat_restarts_total`	counter	count	slug	Total auto-restarts triggered per bot slug.
`polytraders_gov_healthheartbeat_misses_total`	counter	count	slug	Total missed heartbeat polls per bot slug.
`polytraders_gov_healthheartbeat_sweep_duration_ms`	histogram	ms		Wall-clock latency of a full 97-bot sweep cycle.
`polytraders_gov_healthheartbeat_sweeps_total`	counter	count		Total sweep cycles completed.

Alerts

Alert	Condition	Severity	Runbook
`HealthHeartbeatBotDown`	`polytraders_gov_healthheartbeat_bots_unhealthy > 0`	page	#runbook-healthheartbeat-bot-down
`HealthHeartbeatRestartBudgetExhausted`	`rate(polytraders_gov_healthheartbeat_restarts_total[10m]) > 3`	page	#runbook-healthheartbeat-restart-budget
`HealthHeartbeatSweepMissing`	`rate(polytraders_gov_healthheartbeat_sweeps_total[5m]) == 0`	page	#runbook-healthheartbeat-missing
`HealthHeartbeatSweepLatencyHigh`	`histogram_quantile(0.99, polytraders_gov_healthheartbeat_sweep_duration_ms) > 25000`	warn	#runbook-healthheartbeat-latency

Dashboards

Grafana — Governance / HealthHeartbeat liveness overview (all 97 bots)
Grafana — Governance / Auto-restart rate and budget consumption

16. Developer Reporting

{
  "bot_id": "gov.health_heartbeat",
  "event_type": "HEALTH_BOT_MISS",
  "slug": "strat.some_strategy",
  "miss_count": 2,
  "threshold": 3,
  "last_seen_ms": 1746791940000,
  "fired_at_ms": 1746791970000
}

17. Plain-English Reporting

Situation	User-facing explanation
All bots healthy	All system components passed their health checks. Everything is running normally.
A bot was auto-restarted	A component stopped responding and was automatically restarted. Trading and risk monitoring continued without interruption.
A bot is down and restart budget exhausted	A component is not responding and automatic restart attempts have been exhausted. The on-call team has been notified.

18. Failure-Mode Block

main_failure_mode	HealthHeartbeat itself crashes, silently leaving all 97 bots unmonitored. Requires an external deadman watchdog.
false_positive_risk	A healthy bot's health endpoint returns 503 transiently (e.g., during a rolling restart), triggering a spurious miss counter increment.
false_negative_risk	A bot crashes but its health endpoint continues to respond 200 from a zombie process that has stopped processing events — HealthHeartbeat sees it as healthy.
safe_fallback	If HealthHeartbeat cannot reach a bot's health endpoint due to a network partition, it increments miss_count normally and fires the alert after the threshold. The bot is never silently marked healthy on connectivity loss.
required_dependencies	Bot registry (config store), Internal health endpoints on all 97 bots, Process manager (for auto-restart commands), Alerting / paging system, Deadman watchdog for HealthHeartbeat itself

19. Failure-Injection Recipes

Scenario	How to inject	Expected behaviour	Recovery
`BOT_CRASH`	Kill a bot process so its health endpoint stops responding	miss_count increments each poll; after missed_heartbeats_to_alert misses, HEALTH_HEARTBEAT_BOT_DOWN alert fires and restart is triggered	Bot restarts; miss_count resets to 0; HEALTH_HEARTBEAT_BOT_RECOVERED emitted.
`RESTART_BUDGET_EXHAUSTED`	Repeatedly kill a bot faster than restart_budget window (3 crashes in < 10 min)	Third restart fires; fourth missed threshold triggers HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED; no further auto-restart	Manual intervention required; budget resets after 10-minute window.
`HEALTH_HEARTBEAT_SELF_CRASH`	Kill HealthHeartbeat process	Deadman watchdog fires page after 2x heartbeat_interval_s without a sweep OperationsReport	HealthHeartbeat is restarted by the process manager; sweep resumes; miss counts reinitialised.
`ENDPOINT_TIMEOUT`	Set a mock health endpoint to respond after 30s (beyond timeout)	HEALTH_HEARTBEAT_ENDPOINT_TIMEOUT logged; miss_count incremented	When endpoint responds within timeout, miss_count resets.
`NETWORK_PARTITION`	Block internal network between HealthHeartbeat and a subset of bots	Affected bots' miss counts increment; alert fires at threshold; restart attempted (network partition means restart may not help)	Network restored; bots return to healthy; miss counts reset.

20. State & Persistence

Cold-start recovery

On restart, all miss_counts reset to 0. The first sweep re-establishes the health baseline.

21. Concurrency & Idempotency

Aspect	Specification
Execution model	`thread-pool (one HTTP poll per bot in parallel)`
Max in-flight	`97`
Idempotency key	`slug + sweep_start_ms`
Per-call timeout (ms)	`10000`
Backpressure strategy	`cap parallel polls at max_in_flight=97; excess queued to next sweep`
Locking / mutual exclusion	`per-slug mutex on miss_counts and restart_budgets`

22. Dependencies

Depends on (must run first)

Bot	Why	Contract
`internal.config_store`	Bot registry is loaded from config store on startup.

Emits to (downstream consumers)

Bot	Why	Contract
`internal.process_manager`

Sibling bots (same OrderIntent)

Bot	Why	Contract
gov.cron_runner	CronRunner fires the hourly health sweep trigger.

External services

Service	Endpoint	SLA assumed	On failure
Alerting / paging system		99.9% (internal SRE target)

23. Security Surfaces

Abuse vectors considered

A bot returns a fake 200 response from a zombie process to avoid restart
Raising missed_heartbeats_to_alert to a very high value to prevent alerts from firing
Disabling page_on_failure to suppress alerting

Mitigations

page_on_failure is locked immutable; cannot be disabled
heartbeat_interval_s and missed_heartbeats_to_alert have hard maximums enforced at config load
Health endpoint responses are checked for a valid JSON body, not just HTTP status
HealthHeartbeat itself is monitored by an external deadman watchdog

24. Polymarket V2 Compatibility

Aspect	Value
CLOB version	`v2`
Collateral asset	`pUSD`
EIP-712 Exchange domain version	`2`
Aware of builderCode field	no
Aware of negative-risk markets	no
Multi-chain ready	no
SDK used	`internal-only`
Settlement contract	`none`
Notes	`HealthHeartbeat monitors liveness of all bots including V2-aware ones but has no direct CLOB or on-chain interface itself.`

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

Field	Value
spec	`2.0.0`
implementation	`2.1.0`
schema	`2`
released	`2026-04-28`

Migration history

Date	From	To	Reason	Action taken
2026-04-28	v1	v2	CLOB V2 cutover	No direct CLOB changes required. Updated OperationsReport schema; removed stale USDC.e references from sweep report payloads. Added V2-aware bots to the monitoring registry.

26. Acceptance Tests

Unit Tests

Test	Setup	Expected result
miss_count increments on non-200 response	Mock health endpoint returns 503	miss_count incremented; no alert below threshold
Alert fires at threshold	miss_count == missed_heartbeats_to_alert	HEALTH_HEARTBEAT_BOT_DOWN alert emitted; restart triggered if auto_restart=true
Restart budget enforced	3 restarts in 10 minutes for same bot	4th restart blocked; HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED emitted
miss_count resets on recovery	Bot returns to 200 after 2 misses	miss_count reset to 0; HEALTH_HEARTBEAT_BOT_RECOVERED emitted
heartbeat_interval_s above hard maximum rejected	heartbeat_interval_s=400	ConfigError PARAMETER_CHANGE_REQUIRES_APPROVAL

Integration Tests

Test	Expected result
Full sweep of all 97 bots completes within heartbeat_interval_s	OperationsReport emitted with total_bots=97 within configured interval
Auto-restart command delivered to process manager	Restart command published; bot restarts; miss_count resets on recovery

Property Tests

Property	Required behaviour
Every missed heartbeat increments miss_count; no miss is silently dropped	Always true
An OperationsReport is emitted after every sweep cycle	Always true

27. Operational Runbook

HealthHeartbeat incidents are either a bot going down (most common), the restart budget exhausting on a crash-looping bot, or HealthHeartbeat itself failing. All three require immediate response.

On-call actions

Alert	First step	Escalate to
`HealthHeartbeatBotDown`	Identify which bot(s) are unhealthy from the sweep OperationsReport. Check bot logs for crash details.	Layer pod lead for the affected bot
`HealthHeartbeatRestartBudgetExhausted`	Do NOT manually restart the bot without investigating crash logs. Check for crash-loop root cause.	Layer pod lead + SRE on-call immediately
`HealthHeartbeatSweepMissing`	Check HealthHeartbeat process status; verify deadman watchdog is running.	Governance pod lead immediately
`HealthHeartbeatSweepLatencyHigh`	Check internal network latency to bot health endpoints; reduce parallel poll count if overloaded.	SRE on-call after 30 minutes

Manual overrides

polytraders gov health pause-restart --slug <slug> — Stop auto-restart for a specific bot while investigating a crash-loop.

Healthcheck

Endpoint: /internal/health/health-heartbeat | Green: Last sweep completed within 2x heartbeat_interval_s; all bots polled; OperationsReport emitted. | Red: No sweep in 2x heartbeat_interval_s; registry load failed; process unresponsive.

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

Gate	How measured	Threshold
Unit tests pass for miss counting, alert threshold, and restart budget	CI test run	100% pass

Promote to Limited live

Gate	How measured	Threshold
Full 97-bot sweep completes within heartbeat_interval_s under normal load	polytraders_gov_healthheartbeat_sweep_duration_ms histogram	< 30s sweep for 97 bots

Promote to General live

Gate	How measured	Threshold
End-to-end: bot crash detected and auto-restarted within 3 sweep cycles	Failure injection test	Pass
Restart budget exhaustion alert fires and stops further restarts	Failure injection test	Pass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

Requirement	Status
Purpose defined	✓ done
Required inputs listed	✓ done
Parameters defined	✓ done
Defaults defined	✓ done
Warning thresholds defined	✓ done
Hard thresholds defined	✓ done
Safe fallback defined	✓ done
Structured output defined	✓ done
Developer log defined	✓ done
Plain-English explanation	✓ done
Unit tests defined	✓ done
Integration tests defined	✓ done
Property tests defined	✓ done
Failure-mode block complete	✓ done
Reference implementation pseudocode	✓ done
Wire examples (input + output)	✓ done
Reason codes listed	✓ done
Metrics & logs defined	✓ done
State & persistence defined	✓ done
Concurrency & idempotency defined	✓ done
Dependencies declared	✓ done
Security surfaces declared	✓ done
Polymarket V2 compatibility declared	✓ done
Version & migration history declared	✓ done
Operational runbook defined	✓ done
Promotion gates defined	✓ done
Failure-injection recipes defined	✓ done

6.2 Health & Heartbeat

v3 readiness

1. Bot Identity

2. Purpose

3. Why This Bot Matters

A bot crashes silently without HealthHeartbeat running

Auto-restart fires for a bot in a crash-loop

Alert not fired on missed heartbeats

HealthHeartbeat itself is not monitored

4. Required Polymarket Inputs

5. Required Internal Inputs

6. Parameter Guide

7. Detailed Parameter Instructions

heartbeat_interval_s

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

missed_heartbeats_to_alert

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

auto_restart

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

page_on_failure

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

8. Default Configuration

9. Implementation Flow

10. Reference Implementation

SDK calls used

11. Wire Examples

Input — what arrives on the wire

Output — what the bot emits

12. Decision Logic

APPROVE

RESHAPE_REQUIRED

REJECT

WARNING_ONLY

13. Standard Decision Output

14. Reason Codes

15. Metrics & Logs

Metrics emitted

Alerts

Dashboards

16. Developer Reporting

17. Plain-English Reporting

18. Failure-Mode Block

19. Failure-Injection Recipes

20. State & Persistence

Cold-start recovery

21. Concurrency & Idempotency

22. Dependencies

Depends on (must run first)

Emits to (downstream consumers)

Sibling bots (same OrderIntent)

External services

23. Security Surfaces

Abuse vectors considered

Mitigations

24. Polymarket V2 Compatibility

API surfaces declared

Networks supported

25. Versioning & Migration

Migration history

26. Acceptance Tests