1. Bot Identity
| Layer | Governance Governance |
|---|
| Bot class | Governance Service |
|---|
| Authority | Explain |
|---|
| Status | PLANNED |
|---|
| Readiness | Spec started |
|---|
| Runs before | StrategyRegistry promotion decision |
|---|
| Runs after | Shadow or limited-live deployment of a strategy variant |
|---|
| Applies to | All strategies in shadow or limited-live experiment mode |
|---|
| Default mode | shadow_only |
|---|
| User-visible | no |
|---|
| Developer owner | Polytraders core |
|---|
2. Purpose
ExperimentTracker manages shadow and limited-live A/B experiments, records matched-pair samples, computes confidence intervals, and emits a drift signal to StrategyRegistry when a variant underperforms.
3. Why This Bot Matters
No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.
6. Parameter Guide
| Parameter | Default | Warning | Hard | What it controls |
|---|
| min_samples_for_decision | 100 | None | None | Minimum matched-pair samples before a winner can be declared. |
| traffic_split_pct | 10 | 50 | 100 | Percentage of live traffic routed to the variant. |
7. Detailed Parameter Instructions
min_samples_for_decision
What it means
Minimum matched-pair samples before a winner can be declared.
Default
{ "min_samples_for_decision": 100 }
Why this default matters
100 samples gives a reasonable confidence interval for most strategies.
Threshold logic
| Condition | Action |
|---|
| samples < min_samples_for_decision | Do not declare winner; emit EXPERIMENT_INSUFFICIENT_SAMPLES |
Developer check
if samples < p.min_samples_for_decision: emit('EXPERIMENT_INSUFFICIENT_SAMPLES')
User-facing English
The experiment needs enough data before a conclusion can be drawn.
traffic_split_pct
What it means
Percentage of live traffic routed to the variant.
Default
{ "traffic_split_pct": 10 }
Why this default matters
10% limits exposure during shadow phase.
Threshold logic
| Condition | Action |
|---|
| traffic_split_pct > 50 | WARN; require human sign-off |
Developer check
if p.traffic_split_pct > 50: emit('EXPERIMENT_LARGE_SPLIT_WARN')
User-facing English
A small portion of traffic is used for the experiment.
8. Default Configuration
{
"bot_id": "gov.experimenttracker",
"version": "0.1.0",
"mode": "shadow_only",
"defaults": {
"min_samples_for_decision": 100,
"traffic_split_pct": 10,
"auto_promote_on_winning": false,
"require_human_signoff": true
}
}
9. Implementation Flow
- On experiment start, assign variant_id and record traffic_split_pct and baseline strategy slug.
- For each matched pair (shadow fill vs live fill), record edge, slippage, and fill quality in pUSD.
- Compute running confidence intervals on edge delta between variant and control.
- When samples >= min_samples_for_decision and CI is significant, emit EXPERIMENT_RESULT report.
- If variant underperforms control by > 2 sigma, emit drift signal to StrategyRegistry.
- If require_human_signoff=true, block auto-promote even when variant wins.
10. Reference Implementation
Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.
// ---- EXPERIMENT START ----
FUNCTION startExperiment(config):
exp = {id: generateULID(), variant: config.variant_slug,
control: config.control_slug, samples: [], started_at: now()}
postgres.insert('experiments', exp)
EMIT OperationsReport(event_type='EXPERIMENT_STARTED', experiment_id=exp.id)
// ---- SAMPLE RECORDING ----
FUNCTION recordSample(variantFill, controlFill, experimentId):
delta_bps = (variantFill.edge_pusd - controlFill.edge_pusd) / controlFill.notional * 10000
postgres.insert('experiment_samples', {experiment_id: experimentId,
variant_fill_pusd: variantFill.size_pusd,
control_fill_pusd: controlFill.size_pusd,
edge_delta_bps: delta_bps, recorded_at: now()})
// ---- RESULT EVALUATION ----
FUNCTION evaluateExperiment(experimentId):
samples = postgres.select('experiment_samples', WHERE experiment_id=experimentId)
IF len(samples) < config.min_samples_for_decision:
EMIT OperationsReport(event_type='EXPERIMENT_INSUFFICIENT_SAMPLES')
RETURN
ci = computeCI95(samples)
verdict = 'variant_wins' IF ci.low > 0 ELSE 'control_wins' IF ci.high < 0 ELSE 'inconclusive'
EMIT OperationsReport(event_type='EXPERIMENT_RESULT', verdict=verdict,
ci_95_low=ci.low, ci_95_high=ci.high)
IF verdict == 'control_wins':
strategyRegistry.sendDriftSignal(experimentId.variant_slug)
SDK calls used
postgres.insert('experiments', exp)postgres.select('experiment_samples', ...)strategyRegistry.sendDriftSignal(slug)
Complexity: O(S) per evaluation where S = sample count
11. Wire Examples
Input — what arrives on the wire
{
"label": "Matched-pair sample",
"source": "internal.report_bus",
"payload": {
"experiment_id": "exp_sports_v2",
"variant_fill_pusd": 430.0,
"control_fill_pusd": 415.0,
"recorded_at_ms": 1746792060000
}
}
Output — what the bot emits
{
"label": "OperationsReport — EXPERIMENT_RESULT",
"payload": {
"report_id": "ops_exp_01HX9Z",
"event_type": "EXPERIMENT_RESULT",
"verdict": "variant_wins",
"ci_95_low": 1.1,
"ci_95_high": 5.3,
"report_kind": "OperationsReport",
"topic": "polytraders.reports.operations"
}
}
12. Decision Logic
APPROVE
Not applicable — ExperimentTracker records statistical outcomes; it does not approve promotions.
RESHAPE_REQUIRED
Not applicable.
REJECT
Emits drift signal if variant underperforms; StrategyRegistry handles demotion.
WARNING_ONLY
EXPERIMENT_LARGE_SPLIT_WARN when traffic_split_pct > 50.
13. Standard Decision Output
This bot returns a OperationsReport object. See OperationsReport schema.
{
"report_id": "ops_experimenttracker_01HX9Z",
"bot_id": "gov.experimenttracker",
"event_type": "EXPERIMENT_RESULT",
"experiment_id": "exp_sports_v2",
"variant_slug": "sports-model-v2",
"control_slug": "sports-model",
"samples": 150,
"edge_delta_bps": 3.2,
"ci_95_low": 1.1,
"ci_95_high": 5.3,
"verdict": "variant_wins",
"report_kind": "OperationsReport",
"topic": "polytraders.reports.operations"
}
14. Reason Codes
| Code | Severity | Meaning | Action | User-facing message |
|---|
EXPERIMENT_STARTED | INFO | A new experiment was registered. | Log and emit OperationsReport. | |
EXPERIMENT_RESULT | INFO | Experiment concluded with a statistical verdict. | Emit OperationsReport; optionally trigger promotion flow. | |
EXPERIMENT_INSUFFICIENT_SAMPLES | WARN | Insufficient samples to declare a winner. | Continue sampling. | |
EXPERIMENT_LARGE_SPLIT_WARN | WARN | traffic_split_pct > 50%; high exposure to variant. | Emit WARN; require human sign-off. | |
EXPERIMENT_STALLED | WARN | Report bus unavailable; sampling paused. | Pause experiment; emit alert. | |
15. Metrics & Logs
Metrics emitted
| Metric | Type | Unit | Labels | Meaning |
|---|
polytraders_gov_experimenttracker_experiments_total | counter | count | verdict | Total experiments completed by verdict. |
polytraders_gov_experimenttracker_samples_total | counter | count | experiment_id | Total matched-pair samples recorded. |
polytraders_gov_experimenttracker_edge_delta_bps | gauge | bps | experiment_id | Running edge delta between variant and control. |
polytraders_gov_experimenttracker_drift_signals_total | counter | count | slug | Total drift signals sent to StrategyRegistry. |
Alerts
| Alert | Condition | Severity | Runbook |
|---|
ExperimentTrackerStalled | rate(polytraders_gov_experimenttracker_samples_total[30m]) == 0 | P2 | #runbook-experimenttracker-stalled |
ExperimentTrackerDriftSignal | rate(polytraders_gov_experimenttracker_drift_signals_total[10m]) > 0 | P2 | #runbook-experimenttracker-drift |
16. Developer Reporting
{
"bot_id": "gov.experimenttracker",
"event_type": "SAMPLE_RECORDED",
"experiment_id": "exp_sports_v2",
"sample_n": 47,
"variant_fill_pusd": 430.0,
"control_fill_pusd": 415.0,
"edge_delta_bps": 3.6
}
17. Plain-English Reporting
| Situation | User-facing explanation |
|---|
| Experiment concluded with winning variant | The new strategy version performed better in testing and has been flagged for promotion review. |
| Insufficient samples | The experiment is still collecting data. No conclusion yet. |
18. Failure-Mode Block
| main_failure_mode | Report bus is unavailable; matched-pair samples cannot be collected, stalling the experiment. |
|---|
| false_positive_risk | Small sample size produces a false winner due to variance. |
|---|
| false_negative_risk | A genuinely better variant fails to reach significance within the experiment window. |
|---|
| safe_fallback | If report bus is unavailable, pause sample collection and emit EXPERIMENT_STALLED warn. |
|---|
| required_dependencies | internal.report_bus, gov.strategyregistry |
|---|
19. Failure-Injection Recipes
| Scenario | How to inject | Expected behaviour | Recovery |
|---|
REPORT_BUS_UNAVAILABLE | Block reads from internal.report_bus | | Automatic resume when bus is reachable. |
INSUFFICIENT_SAMPLES | Set min_samples=1000 with only 50 samples collected | | Continue sampling until threshold reached. |
DRIFT_SIGNAL | Inject 50 samples where variant edge_delta < -5 bps | | StrategyRegistry demotes variant if configured. |
20. State & Persistence
Cold-start recovery
On restart, reload active experiments from Postgres; resume sampling from last recorded sample.
21. Concurrency & Idempotency
| Aspect | Specification |
|---|
| Execution model | event-driven; one goroutine per active experiment |
| Max in-flight | 20 |
| Idempotency key | experiment_id + sample_n |
| Per-call timeout (ms) | 5000 |
| Backpressure strategy | queue |
| Locking / mutual exclusion | Postgres unique constraint on (experiment_id, sample_n) |
22. Dependencies
Depends on (must run first)
| Bot | Why | Contract |
|---|
internal.report_bus | Matched-pair samples are derived from OperationsReport records on the report bus. | OperationsReport must carry fill metadata. |
Emits to (downstream consumers)
Sibling bots (same OrderIntent)
| Bot | Why | Contract |
|---|
| gov.backtester | Backtester provides replay-mode baseline data for shadow experiments. | Replay reports carry mode=replay. |
External services
| Service | Endpoint | SLA assumed | On failure |
|---|
| Internal Postgres | postgres://internal | 99.9% | Pause sampling; queue samples in memory; flush on reconnect. |
23. Security Surfaces
Abuse vectors considered
- Manipulating sample data to bias experiment toward a preferred variant
Mitigations
- Samples are immutably written to Postgres; no update path exists on experiment_samples
24. Polymarket V2 Compatibility
| Aspect | Value |
|---|
| CLOB version | v2 |
| Collateral asset | pUSD |
| EIP-712 Exchange domain version | 2 |
| Aware of builderCode field | no |
| Aware of negative-risk markets | no |
| Multi-chain ready | no |
| SDK used | py-clob-client-v2 |
| Settlement contract | CTFExchangeV2 |
| Notes | ExperimentTracker is an internal analytics service; uses pUSD for all simulated P&L comparisons. |
API surfaces declared
internal
Networks supported
polygon
25. Versioning & Migration
| Field | Value |
|---|
| spec | 2.0.0 |
| implementation | 0.1.0 |
| schema | 2 |
| released | None |
| planned_release | Q3-2026 |
Migration history
| Date | From | To | Reason | Action taken |
|---|
| 2026-04-28 | n/a | v2-spec | Spec drafted post-CLOB-V2 cutover; bot not yet implemented | Designed against V2 schema (pUSD, builder codes, V2 EIP-712 domain) |
26. Acceptance Tests
Unit Tests
| Test | Setup | Expected result |
|---|
| Winner not declared before min_samples reached | samples=50, min_samples=100 | EXPERIMENT_INSUFFICIENT_SAMPLES |
| Drift signal emitted when variant underperforms by >2 sigma | edge_delta=-5, sigma=2 | Drift signal sent to StrategyRegistry |
Integration Tests
| Test | Expected result |
|---|
| Full experiment lifecycle: start → sample collection → result report → drift signal | OperationsReport with event_type=EXPERIMENT_RESULT emitted |
Property Tests
| Property | Required behaviour |
|---|
| auto_promote_on_winning is gated by require_human_signoff | When require_human_signoff=true, auto-promote never fires regardless of verdict |
27. Operational Runbook
ExperimentTracker incidents involve stalled sampling (bus unavailable) or drift signals blocking a planned promotion.
On-call actions
| Alert | First step | Diagnosis | Mitigation | Escalate to |
|---|
ExperimentTrackerStalled | | | | |
ExperimentTrackerDriftSignal | | | | |
Manual overrides
Healthcheck
/internal/health/experimenttracker → green if Postgres reachable; at least one active experiment has received samples in the last hour; red if No samples recorded in 2h for any active experiment
29. Developer Checklist
Ready-to-ship score: 27/27 sections complete · 100%
| Requirement | Status |
|---|
| Purpose defined | ✓ done |
| Required inputs listed | ✓ done |
| Parameters defined | ✓ done |
| Defaults defined | ✓ done |
| Warning thresholds defined | ✓ done |
| Hard thresholds defined | ✓ done |
| Safe fallback defined | ✓ done |
| Structured output defined | ✓ done |
| Developer log defined | ✓ done |
| Plain-English explanation | ✓ done |
| Unit tests defined | ✓ done |
| Integration tests defined | ✓ done |
| Property tests defined | ✓ done |
| Failure-mode block complete | ✓ done |
| Reference implementation pseudocode | ✓ done |
| Wire examples (input + output) | ✓ done |
| Reason codes listed | ✓ done |
| Metrics & logs defined | ✓ done |
| State & persistence defined | ✓ done |
| Concurrency & idempotency defined | ✓ done |
| Dependencies declared | ✓ done |
| Security surfaces declared | ✓ done |
| Polymarket V2 compatibility declared | ✓ done |
| Version & migration history declared | ✓ done |
| Operational runbook defined | ✓ done |
| Promotion gates defined | ✓ done |
| Failure-injection recipes defined | ✓ done |