Home › By Layer › Governance › 6.9 ExperimentTracker

6.9 ExperimentTracker

Governance Governance Service Explain PLANNED Spec started capital · Indirect P7 · Governance & replay ○ pending stub

ExperimentTracker manages shadow and limited-live A/B experiments, records matched-pair samples, computes confidence intervals, and emits a drift signal to StrategyRegistry when a variant underperforms.

v3 readiness

Docs27/27

donehow scored

Impl0/15

pendinghow scored

Backtest0/4

pendinghow scored

Runtime0/8

pendinghow scored

A bot is done when all four scores are. What does done mean?

← 6.8 StrategyRegistry 6.10 ParameterChangeAuditor →

1. Bot Identity

Layer	Governance Governance
Bot class	Governance Service
Authority	Explain
Status	PLANNED
Readiness	Spec started
Runs before	StrategyRegistry promotion decision
Runs after	Shadow or limited-live deployment of a strategy variant
Applies to	All strategies in shadow or limited-live experiment mode
Default mode	`shadow_only`
User-visible	no
Developer owner	Polytraders core

2. Purpose

3. Why This Bot Matters

No experiment tracking
Promotions are made without statistical evidence; regressions go undetected.
Auto-promote without human sign-off
A variant with a transient winning streak is promoted before significance is established.

No worked examples on this bot yet. Worked examples are optional but strongly recommended — they turn an abstract failure mode into something a developer can verify in a fixture.

4. Required Polymarket Inputs

Input	Source	Required?	Use
None — ExperimentTracker consumes internal report bus data only	`internal`	No	N/A

5. Required Internal Inputs

Input	Source	Required?	Use
Replay-tagged OperationsReport from shadow variant	`gov.backtester`	Yes	Populate matched-pair samples for the variant.
Live OperationsReport from control strategy	`internal.report_bus`	Yes	Baseline comparison for edge and fill quality.

6. Parameter Guide

Parameter	Default	Warning	Hard	What it controls
min_samples_for_decision	`100`	`None`	`None`	Minimum matched-pair samples before a winner can be declared.
traffic_split_pct	`10`	`50`	`100`	Percentage of live traffic routed to the variant.

7. Detailed Parameter Instructions

min_samples_for_decision

What it means

Minimum matched-pair samples before a winner can be declared.

Default

{ "min_samples_for_decision": 100 }

Why this default matters

100 samples gives a reasonable confidence interval for most strategies.

Threshold logic

Condition	Action
samples < min_samples_for_decision	Do not declare winner; emit EXPERIMENT_INSUFFICIENT_SAMPLES

Developer check

if samples < p.min_samples_for_decision: emit('EXPERIMENT_INSUFFICIENT_SAMPLES')

User-facing English

The experiment needs enough data before a conclusion can be drawn.

traffic_split_pct

What it means

Percentage of live traffic routed to the variant.

Default

{ "traffic_split_pct": 10 }

Why this default matters

10% limits exposure during shadow phase.

Threshold logic

Condition	Action
traffic_split_pct > 50	WARN; require human sign-off

Developer check

if p.traffic_split_pct > 50: emit('EXPERIMENT_LARGE_SPLIT_WARN')

User-facing English

A small portion of traffic is used for the experiment.

8. Default Configuration

{
  "bot_id": "gov.experimenttracker",
  "version": "0.1.0",
  "mode": "shadow_only",
  "defaults": {
    "min_samples_for_decision": 100,
    "traffic_split_pct": 10,
    "auto_promote_on_winning": false,
    "require_human_signoff": true
  }
}

9. Implementation Flow

On experiment start, assign variant_id and record traffic_split_pct and baseline strategy slug.
For each matched pair (shadow fill vs live fill), record edge, slippage, and fill quality in pUSD.
Compute running confidence intervals on edge delta between variant and control.
When samples >= min_samples_for_decision and CI is significant, emit EXPERIMENT_RESULT report.
If variant underperforms control by > 2 sigma, emit drift signal to StrategyRegistry.
If require_human_signoff=true, block auto-promote even when variant wins.

10. Reference Implementation

Pseudocode is language-agnostic. FETCH = read input. EMIT = produce output. IF/THEN/ELSE = decision. Translate directly to TypeScript, Python, Go, or Rust.

// ---- EXPERIMENT START ----
FUNCTION startExperiment(config):
  exp = {id: generateULID(), variant: config.variant_slug,
         control: config.control_slug, samples: [], started_at: now()}
  postgres.insert('experiments', exp)
  EMIT OperationsReport(event_type='EXPERIMENT_STARTED', experiment_id=exp.id)

// ---- SAMPLE RECORDING ----
FUNCTION recordSample(variantFill, controlFill, experimentId):
  delta_bps = (variantFill.edge_pusd - controlFill.edge_pusd) / controlFill.notional * 10000
  postgres.insert('experiment_samples', {experiment_id: experimentId,
    variant_fill_pusd: variantFill.size_pusd,
    control_fill_pusd: controlFill.size_pusd,
    edge_delta_bps: delta_bps, recorded_at: now()})

// ---- RESULT EVALUATION ----
FUNCTION evaluateExperiment(experimentId):
  samples = postgres.select('experiment_samples', WHERE experiment_id=experimentId)
  IF len(samples) < config.min_samples_for_decision:
    EMIT OperationsReport(event_type='EXPERIMENT_INSUFFICIENT_SAMPLES')
    RETURN
  ci = computeCI95(samples)
  verdict = 'variant_wins' IF ci.low > 0 ELSE 'control_wins' IF ci.high < 0 ELSE 'inconclusive'
  EMIT OperationsReport(event_type='EXPERIMENT_RESULT', verdict=verdict,
    ci_95_low=ci.low, ci_95_high=ci.high)
  IF verdict == 'control_wins':
    strategyRegistry.sendDriftSignal(experimentId.variant_slug)

SDK calls used

postgres.insert('experiments', exp)
postgres.select('experiment_samples', ...)
strategyRegistry.sendDriftSignal(slug)

Complexity: O(S) per evaluation where S = sample count

11. Wire Examples

Input — what arrives on the wire

{
  "label": "Matched-pair sample",
  "source": "internal.report_bus",
  "payload": {
    "experiment_id": "exp_sports_v2",
    "variant_fill_pusd": 430.0,
    "control_fill_pusd": 415.0,
    "recorded_at_ms": 1746792060000
  }
}

Output — what the bot emits

{
  "label": "OperationsReport — EXPERIMENT_RESULT",
  "payload": {
    "report_id": "ops_exp_01HX9Z",
    "event_type": "EXPERIMENT_RESULT",
    "verdict": "variant_wins",
    "ci_95_low": 1.1,
    "ci_95_high": 5.3,
    "report_kind": "OperationsReport",
    "topic": "polytraders.reports.operations"
  }
}

12. Decision Logic

APPROVE

Not applicable — ExperimentTracker records statistical outcomes; it does not approve promotions.

RESHAPE_REQUIRED

Not applicable.

REJECT

Emits drift signal if variant underperforms; StrategyRegistry handles demotion.

WARNING_ONLY

EXPERIMENT_LARGE_SPLIT_WARN when traffic_split_pct > 50.

13. Standard Decision Output

This bot returns a OperationsReport object. See OperationsReport schema.

{
  "report_id": "ops_experimenttracker_01HX9Z",
  "bot_id": "gov.experimenttracker",
  "event_type": "EXPERIMENT_RESULT",
  "experiment_id": "exp_sports_v2",
  "variant_slug": "sports-model-v2",
  "control_slug": "sports-model",
  "samples": 150,
  "edge_delta_bps": 3.2,
  "ci_95_low": 1.1,
  "ci_95_high": 5.3,
  "verdict": "variant_wins",
  "report_kind": "OperationsReport",
  "topic": "polytraders.reports.operations"
}

14. Reason Codes

Code	Severity	Meaning	Action
`EXPERIMENT_STARTED`	INFO	A new experiment was registered.	Log and emit OperationsReport.
`EXPERIMENT_RESULT`	INFO	Experiment concluded with a statistical verdict.	Emit OperationsReport; optionally trigger promotion flow.
`EXPERIMENT_INSUFFICIENT_SAMPLES`	WARN	Insufficient samples to declare a winner.	Continue sampling.
`EXPERIMENT_LARGE_SPLIT_WARN`	WARN	traffic_split_pct > 50%; high exposure to variant.	Emit WARN; require human sign-off.
`EXPERIMENT_STALLED`	WARN	Report bus unavailable; sampling paused.	Pause experiment; emit alert.

15. Metrics & Logs

Metrics emitted

Metric	Type	Unit	Labels	Meaning
`polytraders_gov_experimenttracker_experiments_total`	counter	count	verdict	Total experiments completed by verdict.
`polytraders_gov_experimenttracker_samples_total`	counter	count	experiment_id	Total matched-pair samples recorded.
`polytraders_gov_experimenttracker_edge_delta_bps`	gauge	bps	experiment_id	Running edge delta between variant and control.
`polytraders_gov_experimenttracker_drift_signals_total`	counter	count	slug	Total drift signals sent to StrategyRegistry.

Alerts

Alert	Condition	Severity	Runbook
`ExperimentTrackerStalled`	`rate(polytraders_gov_experimenttracker_samples_total[30m]) == 0`	P2	#runbook-experimenttracker-stalled
`ExperimentTrackerDriftSignal`	`rate(polytraders_gov_experimenttracker_drift_signals_total[10m]) > 0`	P2	#runbook-experimenttracker-drift

16. Developer Reporting

{
  "bot_id": "gov.experimenttracker",
  "event_type": "SAMPLE_RECORDED",
  "experiment_id": "exp_sports_v2",
  "sample_n": 47,
  "variant_fill_pusd": 430.0,
  "control_fill_pusd": 415.0,
  "edge_delta_bps": 3.6
}

17. Plain-English Reporting

Situation	User-facing explanation
Experiment concluded with winning variant	The new strategy version performed better in testing and has been flagged for promotion review.
Insufficient samples	The experiment is still collecting data. No conclusion yet.

18. Failure-Mode Block

main_failure_mode	Report bus is unavailable; matched-pair samples cannot be collected, stalling the experiment.
false_positive_risk	Small sample size produces a false winner due to variance.
false_negative_risk	A genuinely better variant fails to reach significance within the experiment window.
safe_fallback	If report bus is unavailable, pause sample collection and emit EXPERIMENT_STALLED warn.
required_dependencies	internal.report_bus, gov.strategyregistry

19. Failure-Injection Recipes

Scenario	How to inject	Recovery
`REPORT_BUS_UNAVAILABLE`	Block reads from internal.report_bus	Automatic resume when bus is reachable.
`INSUFFICIENT_SAMPLES`	Set min_samples=1000 with only 50 samples collected	Continue sampling until threshold reached.
`DRIFT_SIGNAL`	Inject 50 samples where variant edge_delta < -5 bps	StrategyRegistry demotes variant if configured.

20. State & Persistence

Cold-start recovery

On restart, reload active experiments from Postgres; resume sampling from last recorded sample.

21. Concurrency & Idempotency

Aspect	Specification
Execution model	`event-driven; one goroutine per active experiment`
Max in-flight	`20`
Idempotency key	`experiment_id + sample_n`
Per-call timeout (ms)	`5000`
Backpressure strategy	`queue`
Locking / mutual exclusion	`Postgres unique constraint on (experiment_id, sample_n)`

22. Dependencies

Depends on (must run first)

Bot	Why	Contract
`internal.report_bus`	Matched-pair samples are derived from OperationsReport records on the report bus.	OperationsReport must carry fill metadata.

Emits to (downstream consumers)

Bot	Why	Contract
gov.strategyregistry

Sibling bots (same OrderIntent)

Bot	Why	Contract
gov.backtester	Backtester provides replay-mode baseline data for shadow experiments.	Replay reports carry mode=replay.

External services

Service	Endpoint	SLA assumed	On failure
Internal Postgres	postgres://internal	99.9%	Pause sampling; queue samples in memory; flush on reconnect.

23. Security Surfaces

Abuse vectors considered

Manipulating sample data to bias experiment toward a preferred variant

Mitigations

Samples are immutably written to Postgres; no update path exists on experiment_samples

24. Polymarket V2 Compatibility

Aspect	Value
CLOB version	`v2`
Collateral asset	`pUSD`
EIP-712 Exchange domain version	`2`
Aware of builderCode field	no
Aware of negative-risk markets	no
Multi-chain ready	no
SDK used	`py-clob-client-v2`
Settlement contract	`CTFExchangeV2`
Notes	`ExperimentTracker is an internal analytics service; uses pUSD for all simulated P&L comparisons.`

API surfaces declared

internal

Networks supported

polygon

25. Versioning & Migration

Field	Value
spec	`2.0.0`
implementation	`0.1.0`
schema	`2`
released	`None`
planned_release	`Q3-2026`

Migration history

Date	From	To	Reason	Action taken
2026-04-28	n/a	v2-spec	Spec drafted post-CLOB-V2 cutover; bot not yet implemented	Designed against V2 schema (pUSD, builder codes, V2 EIP-712 domain)

26. Acceptance Tests

Unit Tests

Test	Setup	Expected result
Winner not declared before min_samples reached	samples=50, min_samples=100	EXPERIMENT_INSUFFICIENT_SAMPLES
Drift signal emitted when variant underperforms by >2 sigma	edge_delta=-5, sigma=2	Drift signal sent to StrategyRegistry

Integration Tests

Test	Expected result
Full experiment lifecycle: start → sample collection → result report → drift signal	OperationsReport with event_type=EXPERIMENT_RESULT emitted

Property Tests

Property	Required behaviour
auto_promote_on_winning is gated by require_human_signoff	When require_human_signoff=true, auto-promote never fires regardless of verdict

27. Operational Runbook

ExperimentTracker incidents involve stalled sampling (bus unavailable) or drift signals blocking a planned promotion.

On-call actions

Alert	First step	Diagnosis	Mitigation	Escalate to
`ExperimentTrackerStalled`
`ExperimentTrackerDriftSignal`

Manual overrides

Healthcheck

/internal/health/experimenttracker → green if Postgres reachable; at least one active experiment has received samples in the last hour; red if No samples recorded in 2h for any active experiment

28. Promotion Gates

A bot does not advance to the next readiness state until every gate below is green. Gates are observable from production data — no subjective sign-off.

Promote to Shadow

Gate	How measured	Threshold
CI computation unit tests pass	CI	100% pass

Promote to Limited live

Gate	How measured	Threshold
End-to-end experiment with synthetic data produces correct verdict	Integration test	Pass

Promote to General live

Gate	How measured	Threshold
One production experiment completed with governance pod review	Governance review	Pass

29. Developer Checklist

Ready-to-ship score: 27/27 sections complete · 100%

Requirement	Status
Purpose defined	✓ done
Required inputs listed	✓ done
Parameters defined	✓ done
Defaults defined	✓ done
Warning thresholds defined	✓ done
Hard thresholds defined	✓ done
Safe fallback defined	✓ done
Structured output defined	✓ done
Developer log defined	✓ done
Plain-English explanation	✓ done
Unit tests defined	✓ done
Integration tests defined	✓ done
Property tests defined	✓ done
Failure-mode block complete	✓ done
Reference implementation pseudocode	✓ done
Wire examples (input + output)	✓ done
Reason codes listed	✓ done
Metrics & logs defined	✓ done
State & persistence defined	✓ done
Concurrency & idempotency defined	✓ done
Dependencies declared	✓ done
Security surfaces declared	✓ done
Polymarket V2 compatibility declared	✓ done
Version & migration history declared	✓ done
Operational runbook defined	✓ done
Promotion gates defined	✓ done
Failure-injection recipes defined	✓ done

6.9 ExperimentTracker

v3 readiness

1. Bot Identity

2. Purpose

3. Why This Bot Matters

No experiment tracking

Auto-promote without human sign-off

4. Required Polymarket Inputs

5. Required Internal Inputs

6. Parameter Guide

7. Detailed Parameter Instructions

min_samples_for_decision

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

traffic_split_pct

What it means

Default

Why this default matters

Threshold logic

Developer check

User-facing English

8. Default Configuration

9. Implementation Flow

10. Reference Implementation

SDK calls used

11. Wire Examples

Input — what arrives on the wire

Output — what the bot emits

12. Decision Logic

APPROVE

RESHAPE_REQUIRED

REJECT

WARNING_ONLY

13. Standard Decision Output

14. Reason Codes

15. Metrics & Logs

Metrics emitted

Alerts

16. Developer Reporting

17. Plain-English Reporting

18. Failure-Mode Block

19. Failure-Injection Recipes

20. State & Persistence

Cold-start recovery

21. Concurrency & Idempotency

22. Dependencies

Depends on (must run first)

Emits to (downstream consumers)

Sibling bots (same OrderIntent)

External services

23. Security Surfaces

Abuse vectors considered

Mitigations

24. Polymarket V2 Compatibility

API surfaces declared

Networks supported

25. Versioning & Migration

Migration history

26. Acceptance Tests

Unit Tests

Integration Tests

Property Tests

27. Operational Runbook

On-call actions

Manual overrides

Healthcheck

28. Promotion Gates

Promote to Shadow

Promote to Limited live

Promote to General live

29. Developer Checklist