1. Bot Identity
| Layer | Governance Governance |
|---|
| Bot class | Governance Service |
|---|
| Authority | Explain |
|---|
| Status | LIVE |
|---|
| Readiness | General live |
|---|
| Runs before | Every bot lifecycle decision — HealthHeartbeat must confirm liveness before strategy logic executes |
|---|
| Runs after | System startup; triggered on CronRunner schedule (every heartbeat_interval_s) |
|---|
| Applies to | All 97 production bots across all layers |
|---|
| Default mode | general_live |
|---|
| User-visible | Advanced details only |
|---|
| Developer owner | Polytraders core — Governance pod |
|---|
2. Purpose
HealthHeartbeat monitors the liveness of all 97 production bots by polling each bot's internal health endpoint at a configurable interval. If a bot misses missed_heartbeats_to_alert consecutive polls, HealthHeartbeat emits a page-severity alert and optionally triggers an auto-restart. It emits an OperationsReport after every sweep cycle summarising bot health across all layers. Internal-only — no external API surface.
3. Why This Bot Matters
A bot crashes silently without HealthHeartbeat running
The dead bot's layer is unguarded. Risk votes, kill-switch checks, or execution guards may stop firing, allowing uncontrolled order flow.
Auto-restart fires for a bot in a crash-loop
Repeated restarts mask a systemic failure and exhaust restart budgets. Without a circuit breaker, the governance layer itself degrades.
Alert not fired on missed heartbeats
On-call is not paged. The dead bot may go unnoticed for hours, accumulating unmonitored risk exposure.
HealthHeartbeat itself is not monitored
The watchdog is unwatched. A dead HealthHeartbeat means all 97 bots run without liveness supervision.
No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.
6. Parameter Guide
| Parameter | Default | Warning | Hard | What it controls |
|---|
| heartbeat_interval_s | 30 | 120 | 300 | How often (in seconds) HealthHeartbeat polls each bot's health endpoint. |
| missed_heartbeats_to_alert | 3 | 5 | 10 | Number of consecutive missed polls before an alert is fired. |
| auto_restart | True | None | None | When true, HealthHeartbeat triggers a restart command after missed_heartbeats_to_alert consecutive failures. Respects a per-bot restart budget. |
| page_on_failure | True | None | None | When true (locked), a page-severity alert is fired for any bot that exceeds the missed heartbeat threshold. |
7. Detailed Parameter Instructions
heartbeat_interval_s
What it means
How often (in seconds) HealthHeartbeat polls each bot's health endpoint.
Default
{ "heartbeat_interval_s": 30 }
Why this default matters
30s gives a 90s detection window for a 3-miss threshold. Increasing beyond 120s delays alerting significantly.
Threshold logic
| Condition | Action |
|---|
| heartbeat_interval_s <= 30 | Normal monitoring |
| 30–120s | WARN — detection latency increased |
| > 300s | Reject config change — PARAMETER_CHANGE_REQUIRES_APPROVAL |
Developer check
if (p.heartbeat_interval_s > p.hard) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')
User-facing English
The system checks that all components are running regularly.
missed_heartbeats_to_alert
What it means
Number of consecutive missed polls before an alert is fired.
Default
{ "missed_heartbeats_to_alert": 3 }
Why this default matters
3 consecutive misses (90s at default interval) is enough to distinguish a transient blip from a real crash.
Threshold logic
| Condition | Action |
|---|
| missed <= 3 | Normal tolerance |
| 4–10 | WARN — alert latency increased |
| > 10 | Reject — PARAMETER_CHANGE_REQUIRES_APPROVAL |
Developer check
if (p.missed_heartbeats_to_alert > p.hard) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')
User-facing English
A component is flagged as unhealthy only after multiple consecutive check failures, to avoid false alarms.
auto_restart
What it means
When true, HealthHeartbeat triggers a restart command after missed_heartbeats_to_alert consecutive failures. Respects a per-bot restart budget.
Default
{ "auto_restart": true }
Why this default matters
Auto-restart recovers from transient crashes without manual intervention, minimising downtime for governance bots.
Threshold logic
| Condition | Action |
|---|
| auto_restart=true AND misses >= threshold | Publish restart command; emit HEALTH_HEARTBEAT_AUTO_RESTART |
| restart_budget exhausted | Emit HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED; page on-call without restarting |
Developer check
if (p.auto_restart && misses >= p.missed_heartbeats_to_alert) triggerRestart(bot_slug)
User-facing English
If a component stops responding, the system will attempt to restart it automatically.
page_on_failure
What it means
When true (locked), a page-severity alert is fired for any bot that exceeds the missed heartbeat threshold.
Default
{ "page_on_failure": true }
Why this default matters
Every bot that stops heartbeating is a potential live incident. Paging is mandatory.
Threshold logic
| Condition | Action |
|---|
| page_on_failure=true AND misses >= threshold | Fire page-severity alert |
| page_on_failure=false | Not permitted — parameter is locked to true |
Developer check
if (!p.page_on_failure) throw ConfigError('PARAMETER_CHANGE_REQUIRES_APPROVAL')
User-facing English
Critical system components are monitored by an on-call team.
8. Default Configuration
{
"bot_id": "gov.health_heartbeat",
"version": "2.0.0",
"mode": "general_live",
"defaults": {
"heartbeat_interval_s": 30,
"missed_heartbeats_to_alert": 3,
"auto_restart": true,
"page_on_failure": true
},
"locked": {
"page_on_failure": {
"immutable": true
},
"heartbeat_interval_s": {
"max": 300
},
"missed_heartbeats_to_alert": {
"max": 10
}
}
}
9. Implementation Flow
- On startup, load the bot registry from the config store; build a polling table keyed by bot_slug with miss_count=0.
- Every heartbeat_interval_s, iterate over all registered bots and issue GET /internal/health/<slug> with a timeout of heartbeat_interval_s/3.
- For each bot: if response is 200 within timeout, reset miss_count to 0 and emit INFO heartbeat.
- If response is non-200 or times out, increment miss_count.
- When miss_count >= missed_heartbeats_to_alert: emit page alert (HEALTH_HEARTBEAT_BOT_DOWN) and, if auto_restart=true, publish restart command to the process manager.
- Track restart budget per bot (default 3 restarts per 10 minutes). If budget is exhausted, emit HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED and stop auto-restarting.
- After each full sweep, emit an OperationsReport summarising: total_bots, healthy_count, unhealthy_count, restarted_count, sweep_duration_ms.
- HealthHeartbeat itself is monitored by a watchdog process (deadman timer) that pages if no OperationsReport is emitted within 2x heartbeat_interval_s.
10. Reference Implementation
Polls all 97 registered bots' health endpoints every heartbeat_interval_s, tracks consecutive misses, fires alerts and auto-restarts at threshold, emits a sweep OperationsReport after each cycle.
Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. Translate to TS/Python/Go/Rust.
// ---- STARTUP ----
FUNCTION init():
registry = FETCH config_store.GET('/bot-registry')
miss_counts = { slug: 0 FOR slug IN registry }
restart_budgets = { slug: { count: 0, window_start: now() } FOR slug IN registry }
setInterval(runSweep, config.heartbeat_interval_s * 1000)
// ---- SWEEP ----
FUNCTION runSweep():
sweep_start = now()
healthy = 0; unhealthy = 0; restarted = 0
unhealthy_bots = []
FOR bot IN registry:
response = FETCH GET '/internal/health/' + bot.slug
TIMEOUT config.heartbeat_interval_s / 3 * 1000
IF response.status == 200:
IF miss_counts[bot.slug] >= config.missed_heartbeats_to_alert:
EMIT alert(HEALTH_HEARTBEAT_BOT_RECOVERED, bot.slug)
miss_counts[bot.slug] = 0
healthy += 1
ELSE:
miss_counts[bot.slug] += 1
unhealthy += 1
IF miss_counts[bot.slug] >= config.missed_heartbeats_to_alert:
alerting.emit('HEALTH_HEARTBEAT_BOT_DOWN', {
slug: bot.slug, miss_count: miss_counts[bot.slug] })
IF config.auto_restart:
budget = restart_budgets[bot.slug]
IF (now() - budget.window_start) > 600_000: // 10-min window
budget.count = 0; budget.window_start = now()
IF budget.count < 3:
internal_bus.publish('process.restart', { slug: bot.slug })
budget.count += 1; restarted += 1
unhealthy_bots.append({ slug: bot.slug, miss_count: miss_counts[bot.slug], action: 'restarted' })
ELSE:
alerting.emit('HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED', { slug: bot.slug })
unhealthy_bots.append({ slug: bot.slug, miss_count: miss_counts[bot.slug], action: 'budget_exhausted' })
EMIT OperationsReport({
report_id: 'ops_health_' + sweep_start,
event_type: 'HEALTH_SWEEP_COMPLETE',
total_bots: len(registry),
healthy_count: healthy,
unhealthy_count: unhealthy,
restarted_count: restarted,
sweep_duration_ms: now() - sweep_start,
unhealthy_bots: unhealthy_bots,
fired_at_ms: sweep_start
})
SDK calls used
config_store.GET('/bot-registry')FETCH GET '/internal/health/<slug>' TIMEOUT <ms>internal_bus.publish('process.restart', { slug })alerting.emit('HEALTH_HEARTBEAT_BOT_DOWN', metadata)alerting.emit('HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED', metadata)
Complexity: O(N) per sweep where N = 97 registered bots
11. Wire Examples
Input — what arrives on the wire
{
"label": "Health endpoint poll response",
"source": "internal /internal/health/<slug>",
"payload": {
"slug": "strat.some_strategy",
"status": "ok",
"last_decision_ms": 1746791970000,
"uptime_s": 86400
}
}
Output — what the bot emits
{
"label": "OperationsReport — HEALTH_SWEEP_COMPLETE",
"payload": {
"report_id": "ops_health_1746792000000",
"bot_id": "gov.health_heartbeat",
"event_type": "HEALTH_SWEEP_COMPLETE",
"total_bots": 97,
"healthy_count": 96,
"unhealthy_count": 1,
"restarted_count": 1,
"sweep_duration_ms": 840,
"unhealthy_bots": [
{
"slug": "strat.some_strategy",
"miss_count": 3,
"action": "restarted"
}
],
"fired_at_ms": 1746792000000,
"report_kind": "OperationsReport"
}
}
12. Decision Logic
APPROVE
Not applicable — HealthHeartbeat does not approve or reject trading decisions.
RESHAPE_REQUIRED
Not applicable.
REJECT
Not applicable as a trading decision.
WARNING_ONLY
A single missed heartbeat increments the miss counter but does not fire an alert. Only consecutive misses at or above the threshold trigger an alert or restart.
13. Standard Decision Output
This bot returns a OperationsReport object. See OperationsReport schema.
{
"report_id": "ops_health_20260509T120000Z",
"bot_id": "gov.health_heartbeat",
"event_type": "HEALTH_SWEEP_COMPLETE",
"total_bots": 97,
"healthy_count": 96,
"unhealthy_count": 1,
"restarted_count": 1,
"sweep_duration_ms": 840,
"unhealthy_bots": [
{
"slug": "strat.some_strategy",
"miss_count": 3,
"action": "restarted"
}
],
"fired_at_ms": 1746792000000,
"report_kind": "OperationsReport"
}
14. Reason Codes
| Code | Severity | Meaning | Action | User-facing message |
|---|
HEALTH_HEARTBEAT_SWEEP_COMPLETE | INFO | Full sweep of all registered bots completed; OperationsReport emitted. | No action — routine heartbeat. | |
HEALTH_HEARTBEAT_BOT_DOWN | WARN | A bot has exceeded the missed_heartbeats_to_alert threshold of consecutive missed polls. | Fire page-severity alert; trigger auto-restart if enabled. | A system component is not responding. The on-call team has been notified. |
HEALTH_HEARTBEAT_BOT_RECOVERED | INFO | A previously unhealthy bot returned a healthy response; miss_count reset to 0. | Emit recovery notification; no further action. | A component that was restarted is now healthy. |
HEALTH_HEARTBEAT_AUTO_RESTART | WARN | CronRunner triggered an automatic restart for a bot that missed the heartbeat threshold. | Log restart; increment restart budget counter. | A component was automatically restarted. |
HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED | WARN | A bot has been restarted the maximum number of times within the restart budget window without recovering. | Stop auto-restarting; escalate page to on-call. | Automatic restart attempts have been exhausted for a component. Manual intervention is required. |
HEALTH_HEARTBEAT_ENDPOINT_TIMEOUT | WARN | A bot's health endpoint did not respond within the configured timeout. | Treat as missed heartbeat; increment miss_count. | |
KILL_SWITCH_ACTIVE | WARN | KillSwitch is active; this is surfaced in the sweep report for context. | Continue monitoring all bots; do not suppress health checks. | |
HEALTH_HEARTBEAT_REGISTRY_STALE | WARN | The bot registry has not been refreshed from the config store within 5 minutes. | Retry registry fetch; alert if stale for > 10 minutes. | |
15. Metrics & Logs
Metrics emitted
| Metric | Type | Unit | Labels | Meaning |
|---|
polytraders_gov_healthheartbeat_bots_healthy | gauge | count | | Number of bots currently in healthy state. |
polytraders_gov_healthheartbeat_bots_unhealthy | gauge | count | | Number of bots currently in unhealthy state (above miss threshold). |
polytraders_gov_healthheartbeat_restarts_total | counter | count | slug | Total auto-restarts triggered per bot slug. |
polytraders_gov_healthheartbeat_misses_total | counter | count | slug | Total missed heartbeat polls per bot slug. |
polytraders_gov_healthheartbeat_sweep_duration_ms | histogram | ms | | Wall-clock latency of a full 97-bot sweep cycle. |
polytraders_gov_healthheartbeat_sweeps_total | counter | count | | Total sweep cycles completed. |
Alerts
| Alert | Condition | Severity | Runbook |
|---|
HealthHeartbeatBotDown | polytraders_gov_healthheartbeat_bots_unhealthy > 0 | page | #runbook-healthheartbeat-bot-down |
HealthHeartbeatRestartBudgetExhausted | rate(polytraders_gov_healthheartbeat_restarts_total[10m]) > 3 | page | #runbook-healthheartbeat-restart-budget |
HealthHeartbeatSweepMissing | rate(polytraders_gov_healthheartbeat_sweeps_total[5m]) == 0 | page | #runbook-healthheartbeat-missing |
HealthHeartbeatSweepLatencyHigh | histogram_quantile(0.99, polytraders_gov_healthheartbeat_sweep_duration_ms) > 25000 | warn | #runbook-healthheartbeat-latency |
Dashboards
- Grafana — Governance / HealthHeartbeat liveness overview (all 97 bots)
- Grafana — Governance / Auto-restart rate and budget consumption
16. Developer Reporting
{
"bot_id": "gov.health_heartbeat",
"event_type": "HEALTH_BOT_MISS",
"slug": "strat.some_strategy",
"miss_count": 2,
"threshold": 3,
"last_seen_ms": 1746791940000,
"fired_at_ms": 1746791970000
}
17. Plain-English Reporting
| Situation | User-facing explanation |
|---|
| All bots healthy | All system components passed their health checks. Everything is running normally. |
| A bot was auto-restarted | A component stopped responding and was automatically restarted. Trading and risk monitoring continued without interruption. |
| A bot is down and restart budget exhausted | A component is not responding and automatic restart attempts have been exhausted. The on-call team has been notified. |
18. Failure-Mode Block
| main_failure_mode | HealthHeartbeat itself crashes, silently leaving all 97 bots unmonitored. Requires an external deadman watchdog. |
|---|
| false_positive_risk | A healthy bot's health endpoint returns 503 transiently (e.g., during a rolling restart), triggering a spurious miss counter increment. |
|---|
| false_negative_risk | A bot crashes but its health endpoint continues to respond 200 from a zombie process that has stopped processing events — HealthHeartbeat sees it as healthy. |
|---|
| safe_fallback | If HealthHeartbeat cannot reach a bot's health endpoint due to a network partition, it increments miss_count normally and fires the alert after the threshold. The bot is never silently marked healthy on connectivity loss. |
|---|
| required_dependencies | Bot registry (config store), Internal health endpoints on all 97 bots, Process manager (for auto-restart commands), Alerting / paging system, Deadman watchdog for HealthHeartbeat itself |
|---|
19. Failure-Injection Recipes
| Scenario | How to inject | Expected behaviour | Recovery |
|---|
BOT_CRASH | Kill a bot process so its health endpoint stops responding | miss_count increments each poll; after missed_heartbeats_to_alert misses, HEALTH_HEARTBEAT_BOT_DOWN alert fires and restart is triggered | Bot restarts; miss_count resets to 0; HEALTH_HEARTBEAT_BOT_RECOVERED emitted. |
RESTART_BUDGET_EXHAUSTED | Repeatedly kill a bot faster than restart_budget window (3 crashes in < 10 min) | Third restart fires; fourth missed threshold triggers HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED; no further auto-restart | Manual intervention required; budget resets after 10-minute window. |
HEALTH_HEARTBEAT_SELF_CRASH | Kill HealthHeartbeat process | Deadman watchdog fires page after 2x heartbeat_interval_s without a sweep OperationsReport | HealthHeartbeat is restarted by the process manager; sweep resumes; miss counts reinitialised. |
ENDPOINT_TIMEOUT | Set a mock health endpoint to respond after 30s (beyond timeout) | HEALTH_HEARTBEAT_ENDPOINT_TIMEOUT logged; miss_count incremented | When endpoint responds within timeout, miss_count resets. |
NETWORK_PARTITION | Block internal network between HealthHeartbeat and a subset of bots | Affected bots' miss counts increment; alert fires at threshold; restart attempted (network partition means restart may not help) | Network restored; bots return to healthy; miss counts reset. |
20. State & Persistence
Cold-start recovery
On restart, all miss_counts reset to 0. The first sweep re-establishes the health baseline.
21. Concurrency & Idempotency
| Aspect | Specification |
|---|
| Execution model | thread-pool (one HTTP poll per bot in parallel) |
| Max in-flight | 97 |
| Idempotency key | slug + sweep_start_ms |
| Per-call timeout (ms) | 10000 |
| Backpressure strategy | cap parallel polls at max_in_flight=97; excess queued to next sweep |
| Locking / mutual exclusion | per-slug mutex on miss_counts and restart_budgets |
22. Dependencies
Depends on (must run first)
| Bot | Why | Contract |
|---|
internal.config_store | Bot registry is loaded from config store on startup. | |
Emits to (downstream consumers)
| Bot | Why | Contract |
|---|
internal.process_manager | | |
Sibling bots (same OrderIntent)
| Bot | Why | Contract |
|---|
| gov.cron_runner | CronRunner fires the hourly health sweep trigger. | |
External services
| Service | Endpoint | SLA assumed | On failure |
|---|
| Alerting / paging system | | 99.9% (internal SRE target) | |
23. Security Surfaces
Abuse vectors considered
- A bot returns a fake 200 response from a zombie process to avoid restart
- Raising missed_heartbeats_to_alert to a very high value to prevent alerts from firing
- Disabling page_on_failure to suppress alerting
Mitigations
- page_on_failure is locked immutable; cannot be disabled
- heartbeat_interval_s and missed_heartbeats_to_alert have hard maximums enforced at config load
- Health endpoint responses are checked for a valid JSON body, not just HTTP status
- HealthHeartbeat itself is monitored by an external deadman watchdog
24. Polymarket V2 Compatibility
| Aspect | Value |
|---|
| CLOB version | v2 |
| Collateral asset | pUSD |
| EIP-712 Exchange domain version | 2 |
| Aware of builderCode field | no |
| Aware of negative-risk markets | no |
| Multi-chain ready | no |
| SDK used | internal-only |
| Settlement contract | none |
| Notes | HealthHeartbeat monitors liveness of all bots including V2-aware ones but has no direct CLOB or on-chain interface itself. |
API surfaces declared
internal
Networks supported
polygon
25. Versioning & Migration
| Field | Value |
|---|
| spec | 2.0.0 |
| implementation | 2.1.0 |
| schema | 2 |
| released | 2026-04-28 |
Migration history
| Date | From | To | Reason | Action taken |
|---|
| 2026-04-28 | v1 | v2 | CLOB V2 cutover | No direct CLOB changes required. Updated OperationsReport schema; removed stale USDC.e references from sweep report payloads. Added V2-aware bots to the monitoring registry. |
26. Acceptance Tests
Unit Tests
| Test | Setup | Expected result |
|---|
| miss_count increments on non-200 response | Mock health endpoint returns 503 | miss_count incremented; no alert below threshold |
| Alert fires at threshold | miss_count == missed_heartbeats_to_alert | HEALTH_HEARTBEAT_BOT_DOWN alert emitted; restart triggered if auto_restart=true |
| Restart budget enforced | 3 restarts in 10 minutes for same bot | 4th restart blocked; HEALTH_HEARTBEAT_RESTART_BUDGET_EXHAUSTED emitted |
| miss_count resets on recovery | Bot returns to 200 after 2 misses | miss_count reset to 0; HEALTH_HEARTBEAT_BOT_RECOVERED emitted |
| heartbeat_interval_s above hard maximum rejected | heartbeat_interval_s=400 | ConfigError PARAMETER_CHANGE_REQUIRES_APPROVAL |
Integration Tests
| Test | Expected result |
|---|
| Full sweep of all 97 bots completes within heartbeat_interval_s | OperationsReport emitted with total_bots=97 within configured interval |
| Auto-restart command delivered to process manager | Restart command published; bot restarts; miss_count resets on recovery |
Property Tests
| Property | Required behaviour |
|---|
| Every missed heartbeat increments miss_count; no miss is silently dropped | Always true |
| An OperationsReport is emitted after every sweep cycle | Always true |
27. Operational Runbook
HealthHeartbeat incidents are either a bot going down (most common), the restart budget exhausting on a crash-looping bot, or HealthHeartbeat itself failing. All three require immediate response.
On-call actions
| Alert | First step | Diagnosis | Mitigation | Escalate to |
|---|
HealthHeartbeatBotDown | Identify which bot(s) are unhealthy from the sweep OperationsReport. Check bot logs for crash details. | | | Layer pod lead for the affected bot |
HealthHeartbeatRestartBudgetExhausted | Do NOT manually restart the bot without investigating crash logs. Check for crash-loop root cause. | | | Layer pod lead + SRE on-call immediately |
HealthHeartbeatSweepMissing | Check HealthHeartbeat process status; verify deadman watchdog is running. | | | Governance pod lead immediately |
HealthHeartbeatSweepLatencyHigh | Check internal network latency to bot health endpoints; reduce parallel poll count if overloaded. | | | SRE on-call after 30 minutes |
Manual overrides
polytraders gov health pause-restart --slug <slug> — Stop auto-restart for a specific bot while investigating a crash-loop.
Healthcheck
Endpoint: /internal/health/health-heartbeat | Green: Last sweep completed within 2x heartbeat_interval_s; all bots polled; OperationsReport emitted. | Red: No sweep in 2x heartbeat_interval_s; registry load failed; process unresponsive.
29. Developer Checklist
Ready-to-ship score: 27/27 sections complete · 100%
| Requirement | Status |
|---|
| Purpose defined | ✓ done |
| Required inputs listed | ✓ done |
| Parameters defined | ✓ done |
| Defaults defined | ✓ done |
| Warning thresholds defined | ✓ done |
| Hard thresholds defined | ✓ done |
| Safe fallback defined | ✓ done |
| Structured output defined | ✓ done |
| Developer log defined | ✓ done |
| Plain-English explanation | ✓ done |
| Unit tests defined | ✓ done |
| Integration tests defined | ✓ done |
| Property tests defined | ✓ done |
| Failure-mode block complete | ✓ done |
| Reference implementation pseudocode | ✓ done |
| Wire examples (input + output) | ✓ done |
| Reason codes listed | ✓ done |
| Metrics & logs defined | ✓ done |
| State & persistence defined | ✓ done |
| Concurrency & idempotency defined | ✓ done |
| Dependencies declared | ✓ done |
| Security surfaces declared | ✓ done |
| Polymarket V2 compatibility declared | ✓ done |
| Version & migration history declared | ✓ done |
| Operational runbook defined | ✓ done |
| Promotion gates defined | ✓ done |
| Failure-injection recipes defined | ✓ done |