Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board
HomeBy LayerGovernance6.9 ExperimentTracker

6.9 ExperimentTracker

Governance Governance Service Explain PLANNED Spec started capital · Indirect P7 · Governance & replay pending stub

ExperimentTracker manages shadow and limited-live A/B experiments, records matched-pair samples, computes confidence intervals, and emits a drift signal to StrategyRegistry when a variant underperforms.

v3 readiness

Docs27/27
donehow scored
Impl0/15
pendinghow scored
Backtest0/4
pendinghow scored
Runtime0/8
pendinghow scored

A bot is done when all four scores are. What does done mean?

1. Bot Identity

LayerGovernance  Governance
Bot classGovernance Service
AuthorityExplain
StatusPLANNED
ReadinessSpec started
Runs beforeStrategyRegistry promotion decision
Runs afterShadow or limited-live deployment of a strategy variant
Applies toAll strategies in shadow or limited-live experiment mode
Default modeshadow_only
User-visibleno
Developer ownerPolytraders core

2. Purpose

ExperimentTracker manages shadow and limited-live A/B experiments, records matched-pair samples, computes confidence intervals, and emits a drift signal to StrategyRegistry when a variant underperforms.

3. Why This Bot Matters

  • No experiment tracking

    Promotions are made without statistical evidence; regressions go undetected.

  • Auto-promote without human sign-off

    A variant with a transient winning streak is promoted before significance is established.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

InputSourceRequired?Use
None — ExperimentTracker consumes internal report bus data onlyinternalNoN/A

5. Required Internal Inputs

InputSourceRequired?Use
Replay-tagged OperationsReport from shadow variantgov.backtesterYesPopulate matched-pair samples for the variant.
Live OperationsReport from control strategyinternal.report_busYesBaseline comparison for edge and fill quality.

6. Parameter Guide

ParameterDefaultWarningHardWhat it controls
min_samples_for_decision100NoneNoneMinimum matched-pair samples before a winner can be declared.
traffic_split_pct1050100Percentage of live traffic routed to the variant.

7. Detailed Parameter Instructions

min_samples_for_decision

What it means

Minimum matched-pair samples before a winner can be declared.

Default

{ "min_samples_for_decision": 100 }

Why this default matters

100 samples gives a reasonable confidence interval for most strategies.

Threshold logic

ConditionAction
samples < min_samples_for_decisionDo not declare winner; emit EXPERIMENT_INSUFFICIENT_SAMPLES

Developer check

if samples < p.min_samples_for_decision: emit('EXPERIMENT_INSUFFICIENT_SAMPLES')

User-facing English

The experiment needs enough data before a conclusion can be drawn.

traffic_split_pct

What it means

Percentage of live traffic routed to the variant.

Default

{ "traffic_split_pct": 10 }

Why this default matters

10% limits exposure during shadow phase.

Threshold logic

ConditionAction
traffic_split_pct > 50WARN; require human sign-off

Developer check

if p.traffic_split_pct > 50: emit('EXPERIMENT_LARGE_SPLIT_WARN')

User-facing English

A small portion of traffic is used for the experiment.

8. Default Configuration

{
  "bot_id": "gov.experimenttracker",
  "version": "0.1.0",
  "mode": "shadow_only",
  "defaults": {
    "min_samples_for_decision": 100,
    "traffic_split_pct": 10,
    "auto_promote_on_winning": false,
    "require_human_signoff": true
  }
}

9. Implementation Flow

  1. On experiment start, assign variant_id and record traffic_split_pct and baseline strategy slug.
  2. For each matched pair (shadow fill vs live fill), record edge, slippage, and fill quality in pUSD.
  3. Compute running confidence intervals on edge delta between variant and control.
  4. When samples >= min_samples_for_decision and CI is significant, emit EXPERIMENT_RESULT report.
  5. If variant underperforms control by > 2 sigma, emit drift signal to StrategyRegistry.
  6. If require_human_signoff=true, block auto-promote even when variant wins.

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

// ---- EXPERIMENT START ----
FUNCTION startExperiment(config):
  exp = {id: generateULID(), variant: config.variant_slug,
         control: config.control_slug, samples: [], started_at: now()}
  postgres.insert('experiments', exp)
  EMIT OperationsReport(event_type='EXPERIMENT_STARTED', experiment_id=exp.id)

// ---- SAMPLE RECORDING ----
FUNCTION recordSample(variantFill, controlFill, experimentId):
  delta_bps = (variantFill.edge_pusd - controlFill.edge_pusd) / controlFill.notional * 10000
  postgres.insert('experiment_samples', {experiment_id: experimentId,
    variant_fill_pusd: variantFill.size_pusd,
    control_fill_pusd: controlFill.size_pusd,
    edge_delta_bps: delta_bps, recorded_at: now()})

// ---- RESULT EVALUATION ----
FUNCTION evaluateExperiment(experimentId):
  samples = postgres.select('experiment_samples', WHERE experiment_id=experimentId)
  IF len(samples) < config.min_samples_for_decision:
    EMIT OperationsReport(event_type='EXPERIMENT_INSUFFICIENT_SAMPLES')
    RETURN
  ci = computeCI95(samples)
  verdict = 'variant_wins' IF ci.low > 0 ELSE 'control_wins' IF ci.high < 0 ELSE 'inconclusive'
  EMIT OperationsReport(event_type='EXPERIMENT_RESULT', verdict=verdict,
    ci_95_low=ci.low, ci_95_high=ci.high)
  IF verdict == 'control_wins':
    strategyRegistry.sendDriftSignal(experimentId.variant_slug)

SDK calls used

  • postgres.insert('experiments', exp)
  • postgres.select('experiment_samples', ...)
  • strategyRegistry.sendDriftSignal(slug)

Complexity: O(S) per evaluation where S = sample count

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Matched-pair sample",
  "source": "internal.report_bus",
  "payload": {
    "experiment_id": "exp_sports_v2",
    "variant_fill_pusd": 430.0,
    "control_fill_pusd": 415.0,
    "recorded_at_ms": 1746792060000
  }
}

Output — what the bot emits

{
  "label": "OperationsReport — EXPERIMENT_RESULT",
  "payload": {
    "report_id": "ops_exp_01HX9Z",
    "event_type": "EXPERIMENT_RESULT",
    "verdict": "variant_wins",
    "ci_95_low": 1.1,
    "ci_95_high": 5.3,
    "report_kind": "OperationsReport",
    "topic": "polytraders.reports.operations"
  }
}

12. Decision Logic

APPROVE

Not applicable — ExperimentTracker records statistical outcomes; it does not approve promotions.

RESHAPE_REQUIRED

Not applicable.

REJECT

Emits drift signal if variant underperforms; StrategyRegistry handles demotion.

WARNING_ONLY

EXPERIMENT_LARGE_SPLIT_WARN when traffic_split_pct > 50.

13. Standard Decision Output

This bot returns a OperationsReport object. See OperationsReport schema.

{
  "report_id": "ops_experimenttracker_01HX9Z",
  "bot_id": "gov.experimenttracker",
  "event_type": "EXPERIMENT_RESULT",
  "experiment_id": "exp_sports_v2",
  "variant_slug": "sports-model-v2",
  "control_slug": "sports-model",
  "samples": 150,
  "edge_delta_bps": 3.2,
  "ci_95_low": 1.1,
  "ci_95_high": 5.3,
  "verdict": "variant_wins",
  "report_kind": "OperationsReport",
  "topic": "polytraders.reports.operations"
}

14. Reason Codes

CodeSeverityMeaningActionUser-facing message
EXPERIMENT_STARTEDINFOA new experiment was registered.Log and emit OperationsReport.
EXPERIMENT_RESULTINFOExperiment concluded with a statistical verdict.Emit OperationsReport; optionally trigger promotion flow.
EXPERIMENT_INSUFFICIENT_SAMPLESWARNInsufficient samples to declare a winner.Continue sampling.
EXPERIMENT_LARGE_SPLIT_WARNWARNtraffic_split_pct > 50%; high exposure to variant.Emit WARN; require human sign-off.
EXPERIMENT_STALLEDWARNReport bus unavailable; sampling paused.Pause experiment; emit alert.

15. Metrics & Logs

Metrics emitted

MetricTypeUnitLabelsMeaning
polytraders_gov_experimenttracker_experiments_totalcountercountverdictTotal experiments completed by verdict.
polytraders_gov_experimenttracker_samples_totalcountercountexperiment_idTotal matched-pair samples recorded.
polytraders_gov_experimenttracker_edge_delta_bpsgaugebpsexperiment_idRunning edge delta between variant and control.
polytraders_gov_experimenttracker_drift_signals_totalcountercountslugTotal drift signals sent to StrategyRegistry.

Alerts

AlertConditionSeverityRunbook
ExperimentTrackerStalledrate(polytraders_gov_experimenttracker_samples_total[30m]) == 0P2#runbook-experimenttracker-stalled
ExperimentTrackerDriftSignalrate(polytraders_gov_experimenttracker_drift_signals_total[10m]) > 0P2#runbook-experimenttracker-drift

16. Developer Reporting

{
  "bot_id": "gov.experimenttracker",
  "event_type": "SAMPLE_RECORDED",
  "experiment_id": "exp_sports_v2",
  "sample_n": 47,
  "variant_fill_pusd": 430.0,
  "control_fill_pusd": 415.0,
  "edge_delta_bps": 3.6
}

17. Plain-English Reporting

SituationUser-facing explanation
Experiment concluded with winning variantThe new strategy version performed better in testing and has been flagged for promotion review.
Insufficient samplesThe experiment is still collecting data. No conclusion yet.

18. Failure-Mode Block

main_failure_modeReport bus is unavailable; matched-pair samples cannot be collected, stalling the experiment.
false_positive_riskSmall sample size produces a false winner due to variance.
false_negative_riskA genuinely better variant fails to reach significance within the experiment window.
safe_fallbackIf report bus is unavailable, pause sample collection and emit EXPERIMENT_STALLED warn.
required_dependenciesinternal.report_bus, gov.strategyregistry

19. Failure-Injection Recipes

ScenarioHow to injectExpected behaviourRecovery
REPORT_BUS_UNAVAILABLEBlock reads from internal.report_busAutomatic resume when bus is reachable.
INSUFFICIENT_SAMPLESSet min_samples=1000 with only 50 samples collectedContinue sampling until threshold reached.
DRIFT_SIGNALInject 50 samples where variant edge_delta < -5 bpsStrategyRegistry demotes variant if configured.

20. State & Persistence

Cold-start recovery

On restart, reload active experiments from Postgres; resume sampling from last recorded sample.

21. Concurrency & Idempotency

AspectSpecification
Execution modelevent-driven; one goroutine per active experiment
Max in-flight20
Idempotency keyexperiment_id + sample_n
Per-call timeout (ms)5000
Backpressure strategyqueue
Locking / mutual exclusionPostgres unique constraint on (experiment_id, sample_n)

22. Dependencies

Depends on (must run first)

BotWhyContract
internal.report_busMatched-pair samples are derived from OperationsReport records on the report bus.OperationsReport must carry fill metadata.

Emits to (downstream consumers)

BotWhyContract
gov.strategyregistry

Sibling bots (same OrderIntent)

BotWhyContract
gov.backtesterBacktester provides replay-mode baseline data for shadow experiments.Replay reports carry mode=replay.

External services

ServiceEndpointSLA assumedOn failure
Internal Postgrespostgres://internal99.9%Pause sampling; queue samples in memory; flush on reconnect.

23. Security Surfaces

Abuse vectors considered

  • Manipulating sample data to bias experiment toward a preferred variant

Mitigations

  • Samples are immutably written to Postgres; no update path exists on experiment_samples

24. Polymarket V2 Compatibility

AspectValue
CLOB versionv2
Collateral assetpUSD
EIP-712 Exchange domain version2
Aware of builderCode fieldno
Aware of negative-risk marketsno
Multi-chain readyno
SDK usedpy-clob-client-v2
Settlement contractCTFExchangeV2
NotesExperimentTracker is an internal analytics service; uses pUSD for all simulated P&L comparisons.

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

FieldValue
spec2.0.0
implementation0.1.0
schema2
releasedNone
planned_releaseQ3-2026

Migration history

DateFromToReasonAction taken
2026-04-28n/av2-specSpec drafted post-CLOB-V2 cutover; bot not yet implementedDesigned against V2 schema (pUSD, builder codes, V2 EIP-712 domain)

26. Acceptance Tests

Unit Tests

TestSetupExpected result
Winner not declared before min_samples reachedsamples=50, min_samples=100EXPERIMENT_INSUFFICIENT_SAMPLES
Drift signal emitted when variant underperforms by >2 sigmaedge_delta=-5, sigma=2Drift signal sent to StrategyRegistry

Integration Tests

TestExpected result
Full experiment lifecycle: start → sample collection → result report → drift signalOperationsReport with event_type=EXPERIMENT_RESULT emitted

Property Tests

PropertyRequired behaviour
auto_promote_on_winning is gated by require_human_signoffWhen require_human_signoff=true, auto-promote never fires regardless of verdict

27. Operational Runbook

ExperimentTracker incidents involve stalled sampling (bus unavailable) or drift signals blocking a planned promotion.

On-call actions

AlertFirst stepDiagnosisMitigationEscalate to
ExperimentTrackerStalled
ExperimentTrackerDriftSignal

Manual overrides

Healthcheck

/internal/health/experimenttracker → green if Postgres reachable; at least one active experiment has received samples in the last hour; red if No samples recorded in 2h for any active experiment

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

GateHow measuredThreshold
CI computation unit tests passCI100% pass

Promote to Limited live

GateHow measuredThreshold
End-to-end experiment with synthetic data produces correct verdictIntegration testPass

Promote to General live

GateHow measuredThreshold
One production experiment completed with governance pod reviewGovernance reviewPass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

RequirementStatus
Purpose defined✓ done
Required inputs listed✓ done
Parameters defined✓ done
Defaults defined✓ done
Warning thresholds defined✓ done
Hard thresholds defined✓ done
Safe fallback defined✓ done
Structured output defined✓ done
Developer log defined✓ done
Plain-English explanation✓ done
Unit tests defined✓ done
Integration tests defined✓ done
Property tests defined✓ done
Failure-mode block complete✓ done
Reference implementation pseudocode✓ done
Wire examples (input + output)✓ done
Reason codes listed✓ done
Metrics & logs defined✓ done
State & persistence defined✓ done
Concurrency & idempotency defined✓ done
Dependencies declared✓ done
Security surfaces declared✓ done
Polymarket V2 compatibility declared✓ done
Version & migration history declared✓ done
Operational runbook defined✓ done
Promotion gates defined✓ done
Failure-injection recipes defined✓ done