Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board
HomeBy LayerGovernance6.15 SLAMonitor

6.15 SLAMonitor

Governance Governance Service Explain PLANNED Spec started capital · Indirect P3 · Reporting & event store pending stub

SLAMonitor tracks service-level objectives committed to internally and to users, measures error-budget burn rate, and emits alerts when burn rate approaches the SLO budget limit. Retained 7 years as a compliance-grade availability record.

v3 readiness

Docs27/27
donehow scored
Impl0/15
pendinghow scored
Backtest0/4
pendinghow scored
Runtime0/8
pendinghow scored

A bot is done when all four scores are. What does done mean?

1. Bot Identity

LayerGovernance  Governance
Bot classGovernance Service
AuthorityExplain
StatusPLANNED
ReadinessSpec started
Runs beforeNothing — SLAMonitor is a passive observer; runs continuously on metrics
Runs afterMetrics are emitted by all bots in the fleet
Applies toAll service-level objectives defined for the Polytraders fleet
Default modeshadow_only
User-visiblesummary-only
Developer ownerPolytraders core

2. Purpose

SLAMonitor tracks service-level objectives committed to internally and to users, measures error-budget burn rate, and emits alerts when burn rate approaches the SLO budget limit. Retained 7 years as a compliance-grade availability record.

3. Why This Bot Matters

  • No SLO tracking

    Availability and latency regressions go undetected until users complain; SLA breach evidence is unavailable for compliance.

  • Error budget burn not tracked

    The team consumes the entire error budget without realising it; no time left for planned maintenance.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

InputSourceRequired?Use
None — SLAMonitor consumes only internal metricsinternalNoN/A

5. Required Internal Inputs

InputSourceRequired?Use
Prometheus/OpenMetrics scrape from all fleet botsinternal.metrics_storeYesCompute SLO compliance and error-budget burn rate.
ExecutionReport streaminternal.report_busYesTrack fill-quality SLOs (latency, fill rate) per strategy.

6. Parameter Guide

ParameterDefaultWarningHardWhat it controls
slo_definitions{'fill_latency_ms_p99': 500, 'fill_success_rate_pct': 99.5, 'uptime_pct': 99.9}NoneNoneMap of SLO name to target value.
burn_rate_alert_pct5.01020Alert when hourly error-budget burn rate exceeds this percentage of the monthly budget.

7. Detailed Parameter Instructions

slo_definitions

What it means

Map of SLO name to target value.

Default

{ "slo_definitions": {"fill_latency_ms_p99": 500, "fill_success_rate_pct": 99.5, "uptime_pct": 99.9} }

Why this default matters

Default SLOs reflect the commitments in the Polytraders service agreement.

Threshold logic

ConditionAction
metric_value violates slo_targetIncrement error budget consumption; emit SLO_BREACH_DETECTED if budget exhausted

Developer check

if metric_value > slo.target: budgetConsumer.record(slo.name)

User-facing English

The system maintains targets for response speed and availability.

burn_rate_alert_pct

What it means

Alert when hourly error-budget burn rate exceeds this percentage of the monthly budget.

Default

{ "burn_rate_alert_pct": 5.0 }

Why this default matters

5% hourly burn means the monthly budget would be exhausted in 20 hours.

Threshold logic

ConditionAction
hourly_burn_rate > burn_rate_alert_pctEmit SLO_BURN_RATE_EXCEEDED alert

Developer check

if hourly_burn > p.burn_rate_alert_pct: emit('SLO_BURN_RATE_EXCEEDED')

User-facing English

You'll be notified if service quality degrades significantly.

8. Default Configuration

{
  "bot_id": "gov.slamonitor",
  "version": "0.1.0",
  "mode": "shadow_only",
  "defaults": {
    "slo_definitions": {
      "fill_latency_ms_p99": 500,
      "fill_success_rate_pct": 99.5,
      "uptime_pct": 99.9
    },
    "burn_rate_alert_pct": 5.0,
    "publish_to_user": true,
    "auto_freeze_on_breach": false
  }
}

9. Implementation Flow

  1. Scrape Prometheus metrics from all fleet bots every 60 seconds.
  2. For each SLO definition, compute current compliance and error-budget consumption.
  3. Compute hourly burn rate as (errors_in_last_hour / monthly_budget * 100).
  4. If burn_rate > burn_rate_alert_pct, emit SLO_BURN_RATE_EXCEEDED alert.
  5. If error budget is exhausted, emit SLO_BREACH_DETECTED and optionally freeze deployments.
  6. Emit SettlementReport(event_type=SLO_STATUS) every hour with all SLO compliance metrics.
  7. Retain SettlementReport records for 7 years as compliance-grade availability evidence.

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

// ---- SCRAPE LOOP (every 60s) ----
FUNCTION scrapeAndEvaluate():
  metrics = FETCH internal.metricsStore.GET({bots: 'all', window: '1m'})
  IF metrics IS NULL:
    EMIT SettlementReport(event_type='SLO_DATA_GAP')
    alerting.emit('SLO_METRICS_UNAVAILABLE')
    RETURN

  compliance = {}
  FOR slo IN config.slo_definitions:
    actual = metrics.get(slo.name)
    compliant = (actual <= slo.target) IF slo.type == 'max' ELSE (actual >= slo.target)
    compliance[slo.name] = {target: slo.target, actual: actual, compliant: compliant}
    IF NOT compliant:
      errorBudget.record(slo.name, violation=True)

// ---- HOURLY REPORT ----
FUNCTION emitHourlyReport(windowStart, windowEnd):
  burnRate = errorBudget.hourlyBurnRate()
  IF burnRate > config.burn_rate_alert_pct:
    alerting.emit('SLO_BURN_RATE_EXCEEDED', {burn_rate: burnRate})
  IF errorBudget.exhausted():
    alerting.emit('SLO_BREACH_DETECTED')
    IF config.auto_freeze_on_breach:
      deploymentManager.freeze()
  EMIT SettlementReport(event_type='SLO_STATUS',
    window_start=windowStart, window_end=windowEnd,
    slo_compliance=compliance,
    error_budget_consumed_pct=errorBudget.consumedPct(),
    hourly_burn_rate_pct=burnRate,
    retained_until=now() + days(2555))

SDK calls used

  • internal.metricsStore.GET({bots, window})
  • errorBudget.hourlyBurnRate()
  • alerting.emit('SLO_BURN_RATE_EXCEEDED', metadata)

Complexity: O(S) per scrape cycle where S = SLO count; O(1) for hourly report

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Prometheus metrics scrape",
  "source": "internal.metrics_store",
  "payload": {
    "fill_latency_ms_p99": 312,
    "fill_success_rate_pct": 99.8,
    "uptime_pct": 100.0,
    "scraped_at_ms": 1746792060000
  }
}

Output — what the bot emits

{
  "label": "SettlementReport — SLO_STATUS",
  "payload": {
    "report_id": "stl_sla_01HX9Z",
    "event_type": "SLO_STATUS",
    "error_budget_consumed_pct": 1.2,
    "hourly_burn_rate_pct": 0.8,
    "report_kind": "SettlementReport",
    "topic": "polytraders.reports.settlement",
    "retained_until": "2033-05-09"
  }
}

12. Decision Logic

APPROVE

Not applicable — SLAMonitor does not approve trading orders.

RESHAPE_REQUIRED

Not applicable.

REJECT

If auto_freeze_on_breach=true, freezes new deployments on SLO breach.

WARNING_ONLY

Emits SLO_BURN_RATE_EXCEEDED when burn rate threshold is crossed.

13. Standard Decision Output

This bot returns a SettlementReport object. See SettlementReport schema.

{
  "report_id": "stl_slamonitor_01HX9Z",
  "bot_id": "gov.slamonitor",
  "event_type": "SLO_STATUS",
  "window_start": "2026-05-09T09:00:00Z",
  "window_end": "2026-05-09T10:00:00Z",
  "slo_compliance": {
    "fill_latency_ms_p99": {
      "target": 500,
      "actual": 312,
      "compliant": true
    },
    "fill_success_rate_pct": {
      "target": 99.5,
      "actual": 99.8,
      "compliant": true
    },
    "uptime_pct": {
      "target": 99.9,
      "actual": 100.0,
      "compliant": true
    }
  },
  "error_budget_consumed_pct": 1.2,
  "hourly_burn_rate_pct": 0.8,
  "report_kind": "SettlementReport",
  "topic": "polytraders.reports.settlement",
  "retained_until": "2033-05-09"
}

14. Reason Codes

CodeSeverityMeaningActionUser-facing message
SLO_STATUSINFOHourly SLO compliance report emitted.Log and store.Service quality is within committed targets.
SLO_BURN_RATE_EXCEEDEDWARNHourly error-budget burn rate exceeds burn_rate_alert_pct.Emit alert; include in SLO_STATUS report.Service quality has degraded; the team has been notified.
SLO_BREACH_DETECTEDHARD_REJECTError budget exhausted for the month.Emit alert; optionally freeze deployments.
SLO_METRICS_UNAVAILABLEWARNMetrics store unavailable; SLO compliance unknown.Emit SLO_DATA_GAP SettlementReport; alert.
KILL_SWITCH_ACTIVEWARNKillSwitch active; noted in SLO report as planned downtime.Exclude kill-switch period from error budget consumption.

15. Metrics & Logs

Metrics emitted

MetricTypeUnitLabelsMeaning
polytraders_gov_slamonitor_slo_compliancegaugeboolslo_nameCurrent compliance status per SLO (1=compliant, 0=breaching).
polytraders_gov_slamonitor_error_budget_consumed_pctgaugepercentslo_namePercentage of monthly error budget consumed per SLO.
polytraders_gov_slamonitor_burn_rate_hourly_pctgaugepercentCurrent hourly burn rate as percentage of monthly budget.
polytraders_gov_slamonitor_status_reports_totalcountercountstatusTotal SLO status reports emitted by status.

Alerts

AlertConditionSeverityRunbook
SLAMonitorBurnRateHighpolytraders_gov_slamonitor_burn_rate_hourly_pct > 5P2#runbook-slamonitor-burnrate
SLAMonitorBreachpolytraders_gov_slamonitor_error_budget_consumed_pct > 100P1#runbook-slamonitor-breach
SLAMonitorMetricsUnavailableabsent(polytraders_gov_slamonitor_slo_compliance)P2#runbook-slamonitor-metrics

16. Developer Reporting

{
  "bot_id": "gov.slamonitor",
  "event_type": "METRICS_SCRAPED",
  "slo_name": "fill_latency_ms_p99",
  "actual_value": 312,
  "target_value": 500,
  "compliant": true,
  "scraped_at_ms": 1746792060000
}

17. Plain-English Reporting

SituationUser-facing explanation
SLO status report publishedService quality is within the committed targets. All systems are operating normally.
SLO burn rate alertService quality has degraded and is consuming the error budget at a high rate. The team has been notified.

18. Failure-Mode Block

main_failure_modeMetrics store is unavailable; SLO compliance cannot be computed; error budget calculation stalls.
false_positive_riskA transient spike in fill latency causes a burn-rate alert that resolves in < 5 minutes.
false_negative_riskA sustained SLO degradation below the burn-rate threshold goes unalerted.
safe_fallbackIf metrics store is unavailable, emit SLO_STATUS with slo_compliance=unknown and alert on data gap.
required_dependenciesinternal.metrics_store (Prometheus), internal.report_bus (ExecutionReport), Postgres SLO store

19. Failure-Injection Recipes

ScenarioHow to injectExpected behaviourRecovery
METRICS_STORE_UNAVAILABLEBlock reads from internal.metrics_storeAutomatic resume when metrics store recovers.
HIGH_BURN_RATEInject 200 fill failures to exhaust fill_success_rate SLO budgetInvestigate and resolve fill failures; error budget resets monthly.
AUTO_FREEZE_ON_BREACHSet auto_freeze_on_breach=true; exhaust error budgetManual unfreeze after SLO remediation.

20. State & Persistence

Cold-start recovery

On restart, reload error budget state from last committed SettlementReport.

21. Concurrency & Idempotency

AspectSpecification
Execution modelsingle-threaded scrape loop + hourly report goroutine
Max in-flight5
Idempotency keywindow_start
Per-call timeout (ms)10000
Backpressure strategyskip scrape if previous not complete
Locking / mutual exclusionPostgres unique constraint on window_start for hourly reports

22. Dependencies

Depends on (must run first)

BotWhyContract
internal.metrics_storeAll SLO compliance data is sourced from Prometheus metrics.Metrics available with < 60s staleness.

Emits to (downstream consumers)

BotWhyContract
internal.post_trade_archive

Sibling bots (same OrderIntent)

BotWhyContract
gov.incidentcommanderSLAMonitor SLO breach events may trigger IncidentCommander declarations.SLO_BREACH_DETECTED event includes scope for IncidentCommander.

External services

ServiceEndpointSLA assumedOn failure
Internal metrics store (Prometheus)https://metrics.internal99.9%Emit SLO_DATA_GAP SettlementReport; alert; resume on recovery.

23. Security Surfaces

Abuse vectors considered

  • Manipulating metrics to suppress SLO breach detection

Mitigations

  • Metrics store is read-only for SLAMonitor; write access is restricted to fleet bots only

24. Polymarket V2 Compatibility

AspectValue
CLOB versionv2
Collateral assetpUSD
EIP-712 Exchange domain version2
Aware of builderCode fieldno
Aware of negative-risk marketsno
Multi-chain readyno
SDK usedpy-clob-client-v2
Settlement contractCTFExchangeV2
NotesSLAMonitor tracks service-level objectives across the Polytraders fleet; no CLOB calls. All latency and budget metrics are pUSD-free.

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

FieldValue
spec2.0.0
implementation0.1.0
schema2
releasedNone
planned_releaseQ4-2026

Migration history

DateFromToReasonAction taken
2026-04-28n/av2-specSpec drafted post-CLOB-V2 cutover; bot not yet implementedDesigned against V2 schema (pUSD, builder codes, V2 EIP-712 domain)

26. Acceptance Tests

Unit Tests

TestSetupExpected result
Burn rate alert fires when hourly burn exceeds thresholdhourly_burn=6.0, burn_rate_alert_pct=5.0SLO_BURN_RATE_EXCEEDED alert emitted
SLO status reports all compliant when all metrics within targetsfill_latency=312, fill_success=99.8, uptime=100.0SettlementReport with all slo_compliance.compliant=true

Integration Tests

TestExpected result
Hourly SettlementReport emitted with correct SLO compliance metricsSettlementReport on polytraders.reports.settlement every 60 minutes

Property Tests

PropertyRequired behaviour
Every SLO status report is retained for >= 2555 daysAlways true

27. Operational Runbook

SLAMonitor incidents require rapid response when error budget burns faster than planned. P1 if budget is exhausted; P2 for high burn rate.

On-call actions

AlertFirst stepDiagnosisMitigationEscalate to
SLAMonitorBreach
SLAMonitorBurnRateHigh

Manual overrides

Healthcheck

/internal/health/slamonitor → green if Metrics store reachable; all SLOs compliant; burn rate < burn_rate_alert_pct; red if Metrics store unreachable or any SLO budget exhausted

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

GateHow measuredThreshold
Burn rate calculation unit tests passCIPass

Promote to Limited live

GateHow measuredThreshold
Hourly SLO status report emitted correctly in staging with 3 SLO definitionsIntegration testPass

Promote to General live

GateHow measuredThreshold
30-day SLO report history retained in Postgres; compliance team sign-offCompliance reviewPass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

RequirementStatus
Purpose defined✓ done
Required inputs listed✓ done
Parameters defined✓ done
Defaults defined✓ done
Warning thresholds defined✓ done
Hard thresholds defined✓ done
Safe fallback defined✓ done
Structured output defined✓ done
Developer log defined✓ done
Plain-English explanation✓ done
Unit tests defined✓ done
Integration tests defined✓ done
Property tests defined✓ done
Failure-mode block complete✓ done
Reference implementation pseudocode✓ done
Wire examples (input + output)✓ done
Reason codes listed✓ done
Metrics & logs defined✓ done
State & persistence defined✓ done
Concurrency & idempotency defined✓ done
Dependencies declared✓ done
Security surfaces declared✓ done
Polymarket V2 compatibility declared✓ done
Version & migration history declared✓ done
Operational runbook defined✓ done
Promotion gates defined✓ done
Failure-injection recipes defined✓ done