Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board
HomeBy LayerGovernance6.11 IncidentCommander

6.11 IncidentCommander

Governance Governance Service Explain PLANNED Spec started capital · Critical P7 · Governance & replay pending stub

IncidentCommander coordinates halts, flattens, and post-mortems when a guard, monitor, or operator declares an incident. It records the incident timeline, dispatches auto-actions by severity, pages on-call, and tracks RCA completion.

v3 readiness

Docs27/27
donehow scored
Impl0/15
pendinghow scored
Backtest0/4
pendinghow scored
Runtime0/8
pendinghow scored

A bot is done when all four scores are. What does done mean?

1. Bot Identity

LayerGovernance  Governance
Bot classGovernance Service
AuthorityExplain
StatusPLANNED
ReadinessSpec started
Runs beforeNothing — IncidentCommander is triggered by guard or operator alerts
Runs afterA guard, monitor, or operator declares an incident
Applies toAny declared incident affecting the Polytraders bot fleet
Default modeshadow_only
User-visibleno
Developer ownerPolytraders core

2. Purpose

IncidentCommander coordinates halts, flattens, and post-mortems when a guard, monitor, or operator declares an incident. It records the incident timeline, dispatches auto-actions by severity, pages on-call, and tracks RCA completion.

3. Why This Bot Matters

  • No centralised incident coordinator

    Multiple bots may take conflicting halt/flatten actions; incident timeline is incoherent.

  • RCA not completed within SLA

    Repeat incidents occur because root cause is never addressed; compliance audit finds gaps.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

InputSourceRequired?Use
None — IncidentCommander is a pure internal governance orchestratorinternalNoN/A

5. Required Internal Inputs

InputSourceRequired?Use
Incident declaration event from any guard or operatorinternalYesTrigger incident workflow; dispatch auto-actions by severity.
KillSwitch active flaggov.killswitchNoCheck if KillSwitch is already active before dispatching halt auto-action.

6. Parameter Guide

ParameterDefaultWarningHardWhat it controls
auto_actions_by_severity{'P0': ['halt_all', 'page_oncall'], 'P1': ['page_oncall'], 'P2': ['notify_slack']}NoneNoneMap of severity level to list of auto-actions to dispatch.
require_rca_within_h241248Hours after incident resolution within which an RCA document must be filed.

7. Detailed Parameter Instructions

auto_actions_by_severity

What it means

Map of severity level to list of auto-actions to dispatch.

Default

{ "auto_actions_by_severity": {"P0": ["halt_all", "page_oncall"], "P1": ["page_oncall"], "P2": ["notify_slack"]} }

Why this default matters

P0 incidents require immediate halt; P1 requires paging; P2 requires notification.

Threshold logic

ConditionAction
severity=P0Dispatch halt_all and page_oncall immediately

Developer check

actions = p.auto_actions_by_severity.get(incident.severity, [])

User-facing English

— not yet authored —

require_rca_within_h

What it means

Hours after incident resolution within which an RCA document must be filed.

Default

{ "require_rca_within_h": 24 }

Why this default matters

24-hour RCA deadline ensures timely learning while context is fresh.

Threshold logic

ConditionAction
rca not filed within require_rca_within_hEmit RCA_OVERDUE alert

Developer check

if now() - incident.resolved_at > hours(p.require_rca_within_h): emit('RCA_OVERDUE')

User-facing English

— not yet authored —

8. Default Configuration

{
  "bot_id": "gov.incidentcommander",
  "version": "0.1.0",
  "mode": "shadow_only",
  "defaults": {
    "auto_actions_by_severity": {
      "P0": [
        "halt_all",
        "page_oncall"
      ],
      "P1": [
        "page_oncall"
      ],
      "P2": [
        "notify_slack"
      ]
    },
    "require_rca_within_h": 24,
    "page_on_severity": "P1",
    "publish_status_externally": false
  }
}

9. Implementation Flow

  1. On incident declaration, assign incident_id (ULID) and record severity, scope, and declaring bot.
  2. Dispatch auto-actions from auto_actions_by_severity map for the declared severity.
  3. Page on-call if incident severity >= page_on_severity threshold.
  4. Record incident timeline events: declaration, auto-actions taken, acknowledgement, resolution.
  5. After resolution, start require_rca_within_h countdown; emit RCA_OVERDUE if RCA is not filed in time.
  6. Emit OperationsReport(event_type=INCIDENT_DECLARED/RESOLVED/RCA_FILED) on each lifecycle event.

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

// ---- INCIDENT DECLARATION ----
FUNCTION declareIncident(declaration):
  incident = {
    id: generateULID(), severity: declaration.severity,
    scope: declaration.scope, declared_by: declaration.bot_id,
    declared_at: now(), status: 'active', timeline: []
  }
  postgres.insert('incidents', incident)
  actions = config.auto_actions_by_severity.get(incident.severity, [])
  FOR action IN actions:
    dispatch(action, incident)
    incident.timeline.append({action: action, at: now()})
  EMIT OperationsReport(event_type='INCIDENT_DECLARED', incident_id=incident.id,
    severity=incident.severity, auto_actions_dispatched=actions)

// ---- RESOLUTION ----
FUNCTION resolveIncident(incidentId, resolvedBy):
  incident = postgres.get('incidents', incidentId)
  incident.status = 'resolved'
  incident.resolved_at = now()
  incident.resolved_by = resolvedBy
  postgres.upsert('incidents', incident)
  scheduleRcaDeadline(incidentId, config.require_rca_within_h)
  EMIT OperationsReport(event_type='INCIDENT_RESOLVED', incident_id=incidentId)

// ---- RCA CHECK ----
FUNCTION checkRcaDeadline(incidentId):
  incident = postgres.get('incidents', incidentId)
  IF incident.rca_filed IS NULL:
    EMIT OperationsReport(event_type='RCA_OVERDUE', incident_id=incidentId)
    alerting.emit('RCA_OVERDUE', {incident_id: incidentId})

SDK calls used

  • postgres.insert('incidents', incident)
  • postgres.upsert('incidents', incident)
  • alerting.emit('RCA_OVERDUE', metadata)

Complexity: O(1) per incident event; O(A) per declaration where A = auto-action count

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Incident declaration",
  "source": "risk.liquidityguard",
  "payload": {
    "declaring_bot": "risk.liquidityguard",
    "severity": "P1",
    "scope": [
      "exec.smartrouter"
    ],
    "declared_at": "2026-05-09T10:00:00Z"
  }
}

Output — what the bot emits

{
  "label": "OperationsReport — INCIDENT_DECLARED",
  "payload": {
    "report_id": "ops_incident_01HX9Z",
    "event_type": "INCIDENT_DECLARED",
    "incident_id": "inc_01HX9Z",
    "severity": "P1",
    "report_kind": "OperationsReport",
    "topic": "polytraders.reports.operations"
  }
}

12. Decision Logic

APPROVE

Not applicable — IncidentCommander does not approve trading orders.

RESHAPE_REQUIRED

Not applicable.

REJECT

Not applicable as a trading decision.

WARNING_ONLY

Emits RCA_OVERDUE warn if RCA is not filed within SLA.

13. Standard Decision Output

This bot returns a OperationsReport object. See OperationsReport schema.

{
  "report_id": "ops_incidentcommander_01HX9Z",
  "bot_id": "gov.incidentcommander",
  "event_type": "INCIDENT_DECLARED",
  "incident_id": "inc_01HX9Z",
  "severity": "P1",
  "scope": [
    "risk.liquidityguard",
    "exec.smartrouter"
  ],
  "auto_actions_dispatched": [
    "page_oncall"
  ],
  "declared_at": "2026-05-09T10:00:00Z",
  "report_kind": "OperationsReport",
  "topic": "polytraders.reports.operations"
}

14. Reason Codes

CodeSeverityMeaningActionUser-facing message
INCIDENT_DECLAREDINFOAn incident was declared and auto-actions dispatched.Log; emit OperationsReport.
INCIDENT_RESOLVEDINFOAn incident was resolved.Log; start RCA countdown.
RCA_OVERDUEWARNRCA not filed within require_rca_within_h.Emit WARN alert.
PAGING_SYSTEM_UNAVAILABLEWARNOn-call paging system is unreachable.Fallback to Slack; emit WARN.
KILL_SWITCH_ACTIVEWARNKillSwitch already active when halt_all dispatched.Log; no duplicate halt needed.

15. Metrics & Logs

Metrics emitted

MetricTypeUnitLabelsMeaning
polytraders_gov_incidentcommander_incidents_totalcountercountseverity, statusTotal incidents by severity and status.
polytraders_gov_incidentcommander_rca_overdue_totalcountercountTotal RCA overdue events.
polytraders_gov_incidentcommander_auto_actions_totalcountercountactionTotal auto-actions dispatched by type.
polytraders_gov_incidentcommander_active_incidentsgaugecountseverityCurrently active incidents by severity.

Alerts

AlertConditionSeverityRunbook
IncidentCommanderRcaOverduerate(polytraders_gov_incidentcommander_rca_overdue_total[1h]) > 0P2#runbook-incidentcommander-rca
IncidentCommanderActiveP0polytraders_gov_incidentcommander_active_incidents{severity='P0'} > 0P0#runbook-incidentcommander-p0
IncidentCommanderPagingUnavailableabsent(polytraders_gov_incidentcommander_auto_actions_total{action='page_oncall'})P1#runbook-incidentcommander-paging

16. Developer Reporting

{
  "bot_id": "gov.incidentcommander",
  "event_type": "AUTO_ACTION_DISPATCHED",
  "incident_id": "inc_01HX9Z",
  "action": "page_oncall",
  "dispatched_at_ms": 1746792060000
}

17. Plain-English Reporting

SituationUser-facing explanation
Incident declaredA system incident has been declared. Automated responses have been triggered based on severity.
RCA overdueThe root cause analysis for a recent incident has not been filed within the required timeframe.

18. Failure-Mode Block

main_failure_modeOn-call paging system is unavailable; P0/P1 incidents do not generate pages.
false_positive_riskA transient alert triggers a P0 incident and halt-all auto-action unnecessarily.
false_negative_riskAn incident is declared at P2 when it should be P0; critical auto-actions are not dispatched.
safe_fallbackIf paging system is unavailable, log PAGING_SYSTEM_UNAVAILABLE and attempt fallback notification via Slack.
required_dependenciesOn-call paging system, Internal audit log store, gov.killswitch

19. Failure-Injection Recipes

ScenarioHow to injectExpected behaviourRecovery
PAGING_SYSTEM_DOWNBlock TCP to paging.internal during P1 incident declarationAutomatic when paging system recovers.
RCA_DEADLINE_EXCEEDEDResolve incident; do not file RCA; wait 25hFile RCA; mark incident RCA-complete.
SPURIOUS_P0_DECLARATIONSend P0 incident declaration from a test botCancel halt via gov.killswitch; resolve incident.

20. State & Persistence

Cold-start recovery

On restart, reload active incidents from Postgres; re-schedule RCA deadlines.

21. Concurrency & Idempotency

AspectSpecification
Execution modelevent-driven; one goroutine per active incident
Max in-flight10
Idempotency keyincident_id
Per-call timeout (ms)5000
Backpressure strategyqueue
Locking / mutual exclusionPostgres row-level lock per incident_id

22. Dependencies

Depends on (must run first)

BotWhyContract
gov.killswitchIncidentCommander checks KillSwitch state before dispatching halt_all.KillSwitch state is queryable in < 100ms.

Emits to (downstream consumers)

BotWhyContract
internal.governance_audit

Sibling bots (same OrderIntent)

BotWhyContract
gov.parameterchangeauditorParameterChangeAuditor provides recent config changes to support RCA.Changes queryable by audited_bot and changed_at.

External services

ServiceEndpointSLA assumedOn failure
On-call paging systemhttps://paging.internal99.9%Fallback to Slack notification; emit PAGING_SYSTEM_UNAVAILABLE.

23. Security Surfaces

Abuse vectors considered

  • Declaring a spurious P0 incident to trigger halt_all and disrupt trading

Mitigations

  • Incident declarations require an authenticated internal bot or operator identity
  • Incident timeline is immutably logged; false declarations are auditable

24. Polymarket V2 Compatibility

AspectValue
CLOB versionv2
Collateral assetpUSD
EIP-712 Exchange domain version2
Aware of builderCode fieldno
Aware of negative-risk marketsno
Multi-chain readyno
SDK usedpy-clob-client-v2
Settlement contractCTFExchangeV2
NotesIncidentCommander is a pure governance orchestration service; interacts with no CLOB surfaces directly.

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

FieldValue
spec2.0.0
implementation0.1.0
schema2
releasedNone
planned_releaseQ3-2026

Migration history

DateFromToReasonAction taken
2026-04-28n/av2-specSpec drafted post-CLOB-V2 cutover; bot not yet implementedDesigned against V2 schema (pUSD, builder codes, V2 EIP-712 domain)

26. Acceptance Tests

Unit Tests

TestSetupExpected result
P0 incident triggers halt_all and page_oncallseverity=P0halt_all and page_oncall dispatched; INCIDENT_DECLARED OperationsReport emitted
RCA_OVERDUE emitted when RCA not filed in timeincident resolved 25h ago, require_rca_within_h=24RCA_OVERDUE emitted

Integration Tests

TestExpected result
Full incident lifecycle: declaration → auto-actions → resolution → RCA filed4 OperationsReport records with correct event_types

Property Tests

PropertyRequired behaviour
Every P0 incident triggers halt_all auto-action within 5 secondsAlways true — auto-actions are dispatched synchronously on declaration

27. Operational Runbook

IncidentCommander incidents require immediate triage. P0 auto-actions halt trading; confirm the incident is genuine before clearing.

On-call actions

AlertFirst stepDiagnosisMitigationEscalate to
IncidentCommanderActiveP0
IncidentCommanderRcaOverdue

Manual overrides

Healthcheck

/internal/health/incidentcommander → green if No active P0 incidents; paging system reachable; no RCA overdue; red if Active P0 incident or paging system unreachable

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

GateHow measuredThreshold
P0 auto-action dispatch unit test passesCIPass

Promote to Limited live

GateHow measuredThreshold
End-to-end incident lifecycle test completes in stagingIntegration testPass

Promote to General live

GateHow measuredThreshold
One production incident handled with full timeline and RCA filedGovernance reviewPass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

RequirementStatus
Purpose defined✓ done
Required inputs listed✓ done
Parameters defined✓ done
Defaults defined✓ done
Warning thresholds defined✓ done
Hard thresholds defined✓ done
Safe fallback defined✓ done
Structured output defined✓ done
Developer log defined✓ done
Plain-English explanation✓ done
Unit tests defined✓ done
Integration tests defined✓ done
Property tests defined✓ done
Failure-mode block complete✓ done
Reference implementation pseudocode✓ done
Wire examples (input + output)✓ done
Reason codes listed✓ done
Metrics & logs defined✓ done
State & persistence defined✓ done
Concurrency & idempotency defined✓ done
Dependencies declared✓ done
Security surfaces declared✓ done
Polymarket V2 compatibility declared✓ done
Version & migration history declared✓ done
Operational runbook defined✓ done
Promotion gates defined✓ done
Failure-injection recipes defined✓ done