6.18 ReplaySimulator
Re-runs any historical pipeline trace against the current bot revisions to verify that the outcome would be the same (or to surface changes). Used for regression testing, incident post-mortems, and 'what would have happened' reviews. Runs only on recorded ReportEnvelope streams — never against live state.
v3 readiness
A bot is done when all four scores are. What does done mean?
1. Bot Identity
| Layer | Governance Governance |
|---|---|
| Bot class | Governance |
| Authority | Simulate |
| Status | PLANNED |
| Readiness | Spec ready |
| Runs before | — |
| Runs after | — |
| Applies to | Continuous |
| Default mode | shadow |
| User-visible | Yes |
| Developer owner | Governance pod |
Operational profile
| Ownership | Governance pod · on-call gov-oncall · #polytraders-gov · escalates to Head of Governance · P3 |
|---|---|
| Latency budget | 600000ms |
| Modes supported | offshadowadvisoryenforced |
| Data freshness | max_market_data_age_ms=0 · max_orderbook_age_ms=0 · on stale → Replay reads only recorded data; live freshness does not apply. |
| Human override | yes · by Governance on-call · logs GOV_REPLAY_OVERRIDE · time-bound: Single job · scope: Single replay window · single approver |
2. Purpose
Re-runs any historical pipeline trace against the current bot revisions to verify that the outcome would be the same (or to surface changes). Used for regression testing, incident post-mortems, and 'what would have happened' reviews. Runs only on recorded ReportEnvelope streams — never against live state.
3. Why This Bot Matters
Regression detection
When a single bot is bumped, the simplest correctness check is to replay yesterday's traffic and diff the outcomes.
Postmortem reproducibility
An incident review that cannot reproduce the exact decision is just speculation.
Promotion gate
Templates require a passing replay against the canonical fixture set before promoting from shadow to advisory.
No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.
4. Required Polymarket Inputs
— not yet authored —
5. Required Internal Inputs
| Input | Source | Required? | Use |
|---|---|---|---|
| Recorded ReportEnvelope stream | ReportEnvelope archive | Yes | Source of inputs to replay. |
| Current bot revisions | Bot registry | Yes | Target system to replay against. |
6. Parameter Guide
| Parameter | Default | Warning | Hard | What it controls |
|---|---|---|---|---|
| max_replay_minutes | 60 | 60 | 180 | Maximum replay duration in a single run. |
| tolerance_bps | 5 | — | — | Tolerance in basis points for numeric outputs (slippage, cost) before a diff is flagged as REGRESSION. |
7. Detailed Parameter Instructions
max_replay_minutes
What it means
Maximum replay duration in a single run.
Default
{ "max_replay_minutes": 60 }
Why this default matters
60 minutes is enough for a single incident window without overwhelming the simulator.
Threshold logic
| Condition | Action |
|---|---|
| 60 | Default |
Developer check
if (replay_window > p.max_replay_minutes) chunk();
User-facing English
(Internal.)
tolerance_bps
What it means
Tolerance in basis points for numeric outputs (slippage, cost) before a diff is flagged as REGRESSION.
Default
{ "tolerance_bps": 5 }
Why this default matters
5 bps is below normal noise floor on Polymarket but tight enough to catch real changes.
Threshold logic
| Condition | Action |
|---|---|
| ≤ 5 bps | MATCH |
| > 5 bps | REGRESSION |
Developer check
if (abs(now - then) > p.tolerance_bps) flag('REGRESSION');
User-facing English
(Internal.)
8. Default Configuration
{
"max_replay_minutes": 60,
"tolerance_bps": 5
}9. Implementation Flow
— not yet authored —
10. Reference Implementation
Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.
for env in archive.window(job.start_ms, job.end_ms):
out = sandbox.run(current_bots, env.input)
if differs(out, env.recorded_output, p.tolerance_bps):
regressions.append(diff_record(env, out))
else:
matches += 1
emit('ReplayDigest', job, matches, regressions[:10])11. Wire Examples
Input — what arrives on the wire
{
"job_id": "replay_001",
"window_start_ms": 1715260000000,
"window_end_ms": 1715263600000
}
Output — what the bot emits
{
"kind": "ReplayDigest",
"matches": 1042,
"regressions": 3
}12. Decision Logic
APPROVE
Strict input match (same intent_id, same payload). Tolerance comparison on numeric outputs. Exact match on enum/categorical outputs.
RESHAPE_REQUIRED
This bot does not reshape orders.
REJECT
No reject path defined for this bot — it is observe-only.
WARNING_ONLY
No warn-only path defined.
13. Standard Decision Output
This bot returns a RiskVote object. See RiskVote schema.
{
"kind": "ReplayDigest",
"window_start_ms": 1715260000000,
"window_end_ms": 1715263600000,
"matches": 1042,
"regressions": 3,
"first_regressions": [
{
"intent_id": "intent_001",
"field": "cost_estimate.slippage_bps",
"then": 35,
"now": 41
}
]
}14. Reason Codes
| Code | Severity | Meaning | Action | User-facing message |
|---|---|---|---|---|
GOV_REPLAY_MATCH | P3 | Gov Replay Match | See decision output and developer log for context. | Replays past activity through the current system to confirm nothing important changed. |
GOV_REPLAY_REGRESSION | P3 | Gov Replay Regression | See decision output and developer log for context. | Replays past activity through the current system to confirm nothing important changed. |
GOV_REPLAY_ABORTED | P3 | Gov Replay Aborted | See decision output and developer log for context. | Replays past activity through the current system to confirm nothing important changed. |
GOV_REPLAY_NO_NETWORK_VIOLATION | P3 | Gov Replay No Network Violation | See decision output and developer log for context. | Replays past activity through the current system to confirm nothing important changed. |
15. Metrics & Logs
Metrics emitted
| Metric | Type | Unit | Labels | Meaning |
|---|---|---|---|---|
replay_jobs_total | counter | event | bot_id | Replay jobs total. |
replay_matches_total | counter | event | bot_id | Replay matches total. |
replay_regressions_total | counter | event | bot_id | Replay regressions total. |
replay_aborts_total | counter | event | bot_id | Replay aborts total. |
Dashboards
- 6.18 overview dashboard
16. Developer Reporting
"Per replay: job_id, window, total_inputs, matches, regressions, runtime_ms."17. Plain-English Reporting
| Situation | User-facing explanation |
|---|---|
| When this bot acts | Replays past activity through the current system to confirm nothing important changed. |
18. Failure-Mode Block
| main_failure_mode | Replay sandbox accidentally calls a live network endpoint. |
|---|---|
| false_positive_risk | Time-dependent outputs (anything reading now_ms()) flagged as regressions; mitigation: the replay runtime injects a frozen clock. |
| false_negative_risk | Bot uses external state not captured in the recording; mitigation: bots that read external state must declare it in `data_freshness.max_external_feed_age_ms` and recordings include it. |
| safe_fallback | If the sandbox cannot guarantee no-network mode, abort the replay and emit ReplayDigest with status=ABORTED. |
| required_dependencies | — |
19. Failure-Injection Recipes
| Scenario | How to inject | Expected behaviour | Recovery |
|---|---|---|---|
Inject a deliberately wrong output and assert the regression is surfaced | Inject a deliberately wrong output and assert the regression is surfaced. | Bot detects within its latency budget and emits the corresponding reason code. | Remove the injected fault; bot returns to healthy state within one debounce window. |
Block the archive read and assert ABORTED status | Block the archive read and assert ABORTED status. | Bot detects within its latency budget and emits the corresponding reason code. | Remove the injected fault; bot returns to healthy state within one debounce window. |
20. State & Persistence
Replay archive index. Job history. No live state.
State stores
| Name | Kind | Key | Value shape | TTL | Durability |
|---|---|---|---|---|---|
replay_simulator_state | in-memory + fast KV mirror | bot_id | Replay archive index. Job history. No live state. | 24h | crash-safe via KV mirror |
Cold-start recovery
Cold-start hydrates from fast KV; missing keys default to safe fallback.
On restart
All in-flight decisions are re-evaluated; no bot decision is trusted across restart without re-emit.
21. Concurrency & Idempotency
| Aspect | Specification |
|---|---|
| Execution model | Runs in a sandbox process pool. Concurrent jobs allowed; each is isolated. |
| Max in-flight | 32 |
| Idempotency key | order_intent_id |
| Replay-safe | True |
| Deduplication | By idempotency_key within a 60s window. |
| Ordering guarantees | Per-market_id FIFO; cross-market unordered. |
| Per-call timeout (ms) | 250 |
| Backpressure strategy | Bounded queue; oldest-dropped with metric increment when full. |
| Locking / mutual exclusion | Per-market_id mutex; no global locks. |
22. Dependencies
| Consumes | ReportEnvelopeArchive |
|---|---|
| Emits | OperationsReport(kind=ReplayDigest) |
| Blocks orders | no |
23. Security Surfaces
Sandbox network is fully blocked. Only replay archive read access.
Signing surface
None — bot does not sign or submit.
Mitigations
- Rate-limit per source
- Audit-log every override
- Require role-based authz on admin paths
24. Polymarket V2 Compatibility
| Aspect | Value |
|---|---|
| CLOB version | V2 |
| Collateral asset | pUSD |
| EIP-712 Exchange domain version | 2 |
| Aware of builderCode field | yes |
| Aware of negative-risk markets | yes |
| Multi-chain ready | yes |
| SDK used | Polymarket CLOB V2 SDK |
| Settlement contract | CTFExchangeV2 |
| Notes | Replays against V2 bot revisions only. |
25. Versioning & Migration
| Field | Value |
|---|---|
| current | 0.1.0 |
| contract_version | 1.0.0 |
| last_breaking_change | none |
| deprecation_window_days | 30 |
26. Acceptance Tests
Unit Tests
| Test | Setup | Expected result |
|---|---|---|
| A replay against an unchanged bot version reports zero regressions on its golden traces. | Synthetic fixture per template. | Behaviour matches the rule described in the test name. |
Integration Tests
| Test | Expected result |
|---|---|
| Replay 1 hour of recorded traffic through a deliberately changed bot and assert regressions are surfaced. | End-to-end behaviour matches the spec without manual intervention. |
Property Tests
| Property | Required behaviour |
|---|---|
| Match count + regression count equals total input count for any non-aborted run. | Always true across all generated inputs. |
27. Operational Runbook
If replays fail with NO_NETWORK_VIOLATION, the offending bot leaks an external call — file a P2 issue.
On-call actions
| Alert | First step | Diagnosis | Mitigation | Escalate to |
|---|---|---|---|---|
6.18_anomaly | Open the bot's reporting page and confirm the alert is real (not a metric hiccup). | Inspect developer log entries for the affected market_id over the last 30 minutes. | Force-clear via Admin UI if the rule is clearly stale; otherwise leave engaged and notify owner. | Governance pod |
Manual overrides
polytraders bot pause 6.18— Disables the bot's enforcement layer; downstream consumers fall back to safe defaults.
Healthcheck
GET /healthz/replay_simulator → 200 if last successful evaluation < 60s ago.28. Promotion Gates
A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.
Promote to Shadow
| Gate | How measured | Threshold |
|---|---|---|
| Stub | golden replay reports zero regressions. | Documented threshold met for the full window. |
Promote to Limited live
| Gate | How measured | Threshold |
|---|---|---|
| Shadow | 14 days running on a daily window. | Documented threshold met for the full window. |
| Advisory | 7 days. | Documented threshold met for the full window. |
Promote to General live
| Gate | How measured | Threshold |
|---|---|---|
| Enforced | every promotion through the modes ladder requires a passing replay digest. | Documented threshold met for the full window. |
29. Developer Checklist
Ready-to-ship score: 27/27 sections complete · 100%
| Requirement | Status |
|---|---|
| Purpose defined | ✓ done |
| Required inputs listed | ✓ done |
| Parameters defined | ✓ done |
| Defaults defined | ✓ done |
| Warning thresholds defined | ✓ done |
| Hard thresholds defined | ✓ done |
| Safe fallback defined | ✓ done |
| Structured output defined | ✓ done |
| Developer log defined | ✓ done |
| Plain-English explanation | ✓ done |
| Unit tests defined | ✓ done |
| Integration tests defined | ✓ done |
| Property tests defined | ✓ done |
| Failure-mode block complete | ✓ done |
| Reference implementation pseudocode | ✓ done |
| Wire examples (input + output) | ✓ done |
| Reason codes listed | ✓ done |
| Metrics & logs defined | ✓ done |
| State & persistence defined | ✓ done |
| Concurrency & idempotency defined | ✓ done |
| Dependencies declared | ✓ done |
| Security surfaces declared | ✓ done |
| Polymarket V2 compatibility declared | ✓ done |
| Version & migration history declared | ✓ done |
| Operational runbook defined | ✓ done |
| Promotion gates defined | ✓ done |
| Failure-injection recipes defined | ✓ done |