Home › By Layer › Governance › 6.15 SLAMonitor

6.15 SLAMonitor

Governance Governance Service Explain PLANNED Spec started capital · Indirect P3 · Reporting & event store ○ pending stub

SLAMonitor tracks service-level objectives committed to internally and to users, measures error-budget burn rate, and emits alerts when burn rate approaches the SLO budget limit. Retained 7 years as a compliance-grade availability record.

v3 readiness

Docs27/27

donehow scored

Impl0/15

pendinghow scored

Backtest0/4

pendinghow scored

Runtime0/8

pendinghow scored

A bot is done when all four scores are. What does done mean?

← 6.14 AttributionRevenueReporter 6.16 ExposureExplainer →

1. Bot Identity

Layer	Governance Governance
Bot class	Governance Service
Authority	Explain
Status	PLANNED
Readiness	Spec started
Runs before	Nothing — SLAMonitor is a passive observer; runs continuously on metrics
Runs after	Metrics are emitted by all bots in the fleet
Applies to	All service-level objectives defined for the Polytraders fleet
Default mode	`shadow_only`
User-visible	summary-only
Developer owner	Polytraders core

2. Purpose

3. Why This Bot Matters

No SLO tracking
Availability and latency regressions go undetected until users complain; SLA breach evidence is unavailable for compliance.
Error budget burn not tracked
The team consumes the entire error budget without realising it; no time left for planned maintenance.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

Input	Source	Required?	Use
None — SLAMonitor consumes only internal metrics	`internal`	No	N/A

5. Required Internal Inputs

Input	Source	Required?	Use
Prometheus/OpenMetrics scrape from all fleet bots	`internal.metrics_store`	Yes	Compute SLO compliance and error-budget burn rate.
ExecutionReport stream	`internal.report_bus`	Yes	Track fill-quality SLOs (latency, fill rate) per strategy.

6. Parameter Guide

Parameter	Default	Warning	Hard	What it controls
slo_definitions	`{'fill_latency_ms_p99': 500, 'fill_success_rate_pct': 99.5, 'uptime_pct': 99.9}`	`None`	`None`	Map of SLO name to target value.
burn_rate_alert_pct	`5.0`	`10`	`20`	Alert when hourly error-budget burn rate exceeds this percentage of the monthly budget.

7. Detailed Parameter Instructions

slo_definitions

What it means

Map of SLO name to target value.

Default

{ "slo_definitions": {"fill_latency_ms_p99": 500, "fill_success_rate_pct": 99.5, "uptime_pct": 99.9} }

Why this default matters

Default SLOs reflect the commitments in the Polytraders service agreement.

Threshold logic

Condition	Action
metric_value violates slo_target	Increment error budget consumption; emit SLO_BREACH_DETECTED if budget exhausted

Developer check

if metric_value > slo.target: budgetConsumer.record(slo.name)

User-facing English

The system maintains targets for response speed and availability.

burn_rate_alert_pct

What it means

Alert when hourly error-budget burn rate exceeds this percentage of the monthly budget.

Default

{ "burn_rate_alert_pct": 5.0 }

Why this default matters

5% hourly burn means the monthly budget would be exhausted in 20 hours.

Threshold logic

Condition	Action
hourly_burn_rate > burn_rate_alert_pct	Emit SLO_BURN_RATE_EXCEEDED alert

Developer check

if hourly_burn > p.burn_rate_alert_pct: emit('SLO_BURN_RATE_EXCEEDED')

User-facing English

You'll be notified if service quality degrades significantly.

8. Default Configuration

{
  "bot_id": "gov.slamonitor",
  "version": "0.1.0",
  "mode": "shadow_only",
  "defaults": {
    "slo_definitions": {
      "fill_latency_ms_p99": 500,
      "fill_success_rate_pct": 99.5,
      "uptime_pct": 99.9
    },
    "burn_rate_alert_pct": 5.0,
    "publish_to_user": true,
    "auto_freeze_on_breach": false
  }
}

9. Implementation Flow

Scrape Prometheus metrics from all fleet bots every 60 seconds.
For each SLO definition, compute current compliance and error-budget consumption.
Compute hourly burn rate as (errors_in_last_hour / monthly_budget * 100).
If burn_rate > burn_rate_alert_pct, emit SLO_BURN_RATE_EXCEEDED alert.
If error budget is exhausted, emit SLO_BREACH_DETECTED and optionally freeze deployments.
Emit SettlementReport(event_type=SLO_STATUS) every hour with all SLO compliance metrics.
Retain SettlementReport records for 7 years as compliance-grade availability evidence.

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

// ---- SCRAPE LOOP (every 60s) ----
FUNCTION scrapeAndEvaluate():
  metrics = FETCH internal.metricsStore.GET({bots: 'all', window: '1m'})
  IF metrics IS NULL:
    EMIT SettlementReport(event_type='SLO_DATA_GAP')
    alerting.emit('SLO_METRICS_UNAVAILABLE')
    RETURN

  compliance = {}
  FOR slo IN config.slo_definitions:
    actual = metrics.get(slo.name)
    compliant = (actual <= slo.target) IF slo.type == 'max' ELSE (actual >= slo.target)
    compliance[slo.name] = {target: slo.target, actual: actual, compliant: compliant}
    IF NOT compliant:
      errorBudget.record(slo.name, violation=True)

// ---- HOURLY REPORT ----
FUNCTION emitHourlyReport(windowStart, windowEnd):
  burnRate = errorBudget.hourlyBurnRate()
  IF burnRate > config.burn_rate_alert_pct:
    alerting.emit('SLO_BURN_RATE_EXCEEDED', {burn_rate: burnRate})
  IF errorBudget.exhausted():
    alerting.emit('SLO_BREACH_DETECTED')
    IF config.auto_freeze_on_breach:
      deploymentManager.freeze()
  EMIT SettlementReport(event_type='SLO_STATUS',
    window_start=windowStart, window_end=windowEnd,
    slo_compliance=compliance,
    error_budget_consumed_pct=errorBudget.consumedPct(),
    hourly_burn_rate_pct=burnRate,
    retained_until=now() + days(2555))

SDK calls used

internal.metricsStore.GET({bots, window})
errorBudget.hourlyBurnRate()
alerting.emit('SLO_BURN_RATE_EXCEEDED', metadata)

Complexity: O(S) per scrape cycle where S = SLO count; O(1) for hourly report

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Prometheus metrics scrape",
  "source": "internal.metrics_store",
  "payload": {
    "fill_latency_ms_p99": 312,
    "fill_success_rate_pct": 99.8,
    "uptime_pct": 100.0,
    "scraped_at_ms": 1746792060000
  }
}

Output — what the bot emits

{
  "label": "SettlementReport — SLO_STATUS",
  "payload": {
    "report_id": "stl_sla_01HX9Z",
    "event_type": "SLO_STATUS",
    "error_budget_consumed_pct": 1.2,
    "hourly_burn_rate_pct": 0.8,
    "report_kind": "SettlementReport",
    "topic": "polytraders.reports.settlement",
    "retained_until": "2033-05-09"
  }
}

12. Decision Logic

APPROVE

Not applicable — SLAMonitor does not approve trading orders.

RESHAPE_REQUIRED

Not applicable.

REJECT

If auto_freeze_on_breach=true, freezes new deployments on SLO breach.

WARNING_ONLY

Emits SLO_BURN_RATE_EXCEEDED when burn rate threshold is crossed.

13. Standard Decision Output

This bot returns a SettlementReport object. See SettlementReport schema.

{
  "report_id": "stl_slamonitor_01HX9Z",
  "bot_id": "gov.slamonitor",
  "event_type": "SLO_STATUS",
  "window_start": "2026-05-09T09:00:00Z",
  "window_end": "2026-05-09T10:00:00Z",
  "slo_compliance": {
    "fill_latency_ms_p99": {
      "target": 500,
      "actual": 312,
      "compliant": true
    },
    "fill_success_rate_pct": {
      "target": 99.5,
      "actual": 99.8,
      "compliant": true
    },
    "uptime_pct": {
      "target": 99.9,
      "actual": 100.0,
      "compliant": true
    }
  },
  "error_budget_consumed_pct": 1.2,
  "hourly_burn_rate_pct": 0.8,
  "report_kind": "SettlementReport",
  "topic": "polytraders.reports.settlement",
  "retained_until": "2033-05-09"
}

14. Reason Codes

Code	Severity	Meaning	Action	User-facing message
`SLO_STATUS`	INFO	Hourly SLO compliance report emitted.	Log and store.	Service quality is within committed targets.
`SLO_BURN_RATE_EXCEEDED`	WARN	Hourly error-budget burn rate exceeds burn_rate_alert_pct.	Emit alert; include in SLO_STATUS report.	Service quality has degraded; the team has been notified.
`SLO_BREACH_DETECTED`	HARD_REJECT	Error budget exhausted for the month.	Emit alert; optionally freeze deployments.
`SLO_METRICS_UNAVAILABLE`	WARN	Metrics store unavailable; SLO compliance unknown.	Emit SLO_DATA_GAP SettlementReport; alert.
`KILL_SWITCH_ACTIVE`	WARN	KillSwitch active; noted in SLO report as planned downtime.	Exclude kill-switch period from error budget consumption.

15. Metrics & Logs

Metrics emitted

Metric	Type	Unit	Labels	Meaning
`polytraders_gov_slamonitor_slo_compliance`	gauge	bool	slo_name	Current compliance status per SLO (1=compliant, 0=breaching).
`polytraders_gov_slamonitor_error_budget_consumed_pct`	gauge	percent	slo_name	Percentage of monthly error budget consumed per SLO.
`polytraders_gov_slamonitor_burn_rate_hourly_pct`	gauge	percent		Current hourly burn rate as percentage of monthly budget.
`polytraders_gov_slamonitor_status_reports_total`	counter	count	status	Total SLO status reports emitted by status.

Alerts

Alert	Condition	Severity	Runbook
`SLAMonitorBurnRateHigh`	`polytraders_gov_slamonitor_burn_rate_hourly_pct > 5`	P2	#runbook-slamonitor-burnrate
`SLAMonitorBreach`	`polytraders_gov_slamonitor_error_budget_consumed_pct > 100`	P1	#runbook-slamonitor-breach
`SLAMonitorMetricsUnavailable`	`absent(polytraders_gov_slamonitor_slo_compliance)`	P2	#runbook-slamonitor-metrics

16. Developer Reporting

{
  "bot_id": "gov.slamonitor",
  "event_type": "METRICS_SCRAPED",
  "slo_name": "fill_latency_ms_p99",
  "actual_value": 312,
  "target_value": 500,
  "compliant": true,
  "scraped_at_ms": 1746792060000
}

17. Plain-English Reporting

Situation	User-facing explanation
SLO status report published	Service quality is within the committed targets. All systems are operating normally.
SLO burn rate alert	Service quality has degraded and is consuming the error budget at a high rate. The team has been notified.

18. Failure-Mode Block

main_failure_mode	Metrics store is unavailable; SLO compliance cannot be computed; error budget calculation stalls.
false_positive_risk	A transient spike in fill latency causes a burn-rate alert that resolves in < 5 minutes.
false_negative_risk	A sustained SLO degradation below the burn-rate threshold goes unalerted.
safe_fallback	If metrics store is unavailable, emit SLO_STATUS with slo_compliance=unknown and alert on data gap.
required_dependencies	internal.metrics_store (Prometheus), internal.report_bus (ExecutionReport), Postgres SLO store

19. Failure-Injection Recipes

Scenario	How to inject	Recovery
`METRICS_STORE_UNAVAILABLE`	Block reads from internal.metrics_store	Automatic resume when metrics store recovers.
`HIGH_BURN_RATE`	Inject 200 fill failures to exhaust fill_success_rate SLO budget	Investigate and resolve fill failures; error budget resets monthly.
`AUTO_FREEZE_ON_BREACH`	Set auto_freeze_on_breach=true; exhaust error budget	Manual unfreeze after SLO remediation.

20. State & Persistence

Cold-start recovery

On restart, reload error budget state from last committed SettlementReport.

21. Concurrency & Idempotency

Aspect	Specification
Execution model	`single-threaded scrape loop + hourly report goroutine`
Max in-flight	`5`
Idempotency key	`window_start`
Per-call timeout (ms)	`10000`
Backpressure strategy	`skip scrape if previous not complete`
Locking / mutual exclusion	`Postgres unique constraint on window_start for hourly reports`

22. Dependencies

Depends on (must run first)

Bot	Why	Contract
`internal.metrics_store`	All SLO compliance data is sourced from Prometheus metrics.	Metrics available with < 60s staleness.

Emits to (downstream consumers)

Bot	Why	Contract
`internal.post_trade_archive`

Sibling bots (same OrderIntent)

Bot	Why	Contract
gov.incidentcommander	SLAMonitor SLO breach events may trigger IncidentCommander declarations.	SLO_BREACH_DETECTED event includes scope for IncidentCommander.

External services

Service	Endpoint	SLA assumed	On failure
Internal metrics store (Prometheus)	https://metrics.internal	99.9%	Emit SLO_DATA_GAP SettlementReport; alert; resume on recovery.

23. Security Surfaces

Abuse vectors considered

Manipulating metrics to suppress SLO breach detection

Mitigations

Metrics store is read-only for SLAMonitor; write access is restricted to fleet bots only

24. Polymarket V2 Compatibility

Aspect	Value
CLOB version	`v2`
Collateral asset	`pUSD`
EIP-712 Exchange domain version	`2`
Aware of builderCode field	no
Aware of negative-risk markets	no
Multi-chain ready	no
SDK used	`py-clob-client-v2`
Settlement contract	`CTFExchangeV2`
Notes	`SLAMonitor tracks service-level objectives across the Polytraders fleet; no CLOB calls. All latency and budget metrics are pUSD-free.`

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

Field	Value
spec	`2.0.0`
implementation	`0.1.0`
schema	`2`
released	`None`
planned_release	`Q4-2026`

Migration history

Date	From	To	Reason	Action taken
2026-04-28	n/a	v2-spec	Spec drafted post-CLOB-V2 cutover; bot not yet implemented	Designed against V2 schema (pUSD, builder codes, V2 EIP-712 domain)

26. Acceptance Tests

Unit Tests

Test	Setup	Expected result
Burn rate alert fires when hourly burn exceeds threshold	hourly_burn=6.0, burn_rate_alert_pct=5.0	SLO_BURN_RATE_EXCEEDED alert emitted
SLO status reports all compliant when all metrics within targets	fill_latency=312, fill_success=99.8, uptime=100.0	SettlementReport with all slo_compliance.compliant=true

Integration Tests

Test	Expected result
Hourly SettlementReport emitted with correct SLO compliance metrics	SettlementReport on polytraders.reports.settlement every 60 minutes

Property Tests

Property	Required behaviour
Every SLO status report is retained for >= 2555 days	Always true

27. Operational Runbook

SLAMonitor incidents require rapid response when error budget burns faster than planned. P1 if budget is exhausted; P2 for high burn rate.

On-call actions

Alert	First step	Diagnosis	Mitigation	Escalate to
`SLAMonitorBreach`
`SLAMonitorBurnRateHigh`

Manual overrides

Healthcheck

/internal/health/slamonitor → green if Metrics store reachable; all SLOs compliant; burn rate < burn_rate_alert_pct; red if Metrics store unreachable or any SLO budget exhausted

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

Gate	How measured	Threshold
Burn rate calculation unit tests pass	CI	Pass

Promote to Limited live

Gate	How measured	Threshold
Hourly SLO status report emitted correctly in staging with 3 SLO definitions	Integration test	Pass

Promote to General live

Gate	How measured	Threshold
30-day SLO report history retained in Postgres; compliance team sign-off	Compliance review	Pass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

Requirement	Status
Purpose defined	✓ done
Required inputs listed	✓ done
Parameters defined	✓ done
Defaults defined	✓ done
Warning thresholds defined	✓ done
Hard thresholds defined	✓ done
Safe fallback defined	✓ done
Structured output defined	✓ done
Developer log defined	✓ done
Plain-English explanation	✓ done
Unit tests defined	✓ done
Integration tests defined	✓ done
Property tests defined	✓ done
Failure-mode block complete	✓ done
Reference implementation pseudocode	✓ done
Wire examples (input + output)	✓ done
Reason codes listed	✓ done
Metrics & logs defined	✓ done
State & persistence defined	✓ done
Concurrency & idempotency defined	✓ done
Dependencies declared	✓ done
Security surfaces declared	✓ done
Polymarket V2 compatibility declared	✓ done
Version & migration history declared	✓ done
Operational runbook defined	✓ done
Promotion gates defined	✓ done
Failure-injection recipes defined	✓ done

6.15 SLAMonitor

v3 readiness

1. Bot Identity

2. Purpose

3. Why This Bot Matters

No SLO tracking

Error budget burn not tracked

4. Required Polymarket Inputs

5. Required Internal Inputs

6. Parameter Guide

7. Detailed Parameter Instructions

slo_definitions

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

burn_rate_alert_pct

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

8. Default Configuration

9. Implementation Flow

10. Reference Implementation

SDK calls used

11. Wire Examples

Input — what arrives on the wire

Output — what the bot emits

12. Decision Logic

APPROVE

RESHAPE_REQUIRED

REJECT

WARNING_ONLY

13. Standard Decision Output

14. Reason Codes

15. Metrics & Logs

Metrics emitted

Alerts

16. Developer Reporting

17. Plain-English Reporting

18. Failure-Mode Block

19. Failure-Injection Recipes

20. State & Persistence

Cold-start recovery

21. Concurrency & Idempotency

22. Dependencies

Depends on (must run first)

Emits to (downstream consumers)

Sibling bots (same OrderIntent)

External services

23. Security Surfaces

Abuse vectors considered

Mitigations

24. Polymarket V2 Compatibility

API surfaces declared

Networks supported

25. Versioning & Migration

Migration history

26. Acceptance Tests

Unit Tests

Integration Tests

Property Tests

27. Operational Runbook

On-call actions

Manual overrides

Healthcheck

28. Promotion Gates

Promote to Shadow

Promote to Limited live

Promote to General live

29. Developer Checklist