Stage 10

Replay, backtesting, and validation

The platform becomes trustworthy when the same raw inputs recreate the same state, the same decisions, and the same simulated fills.

Challenge we are solving

Most backtests tell comforting stories because they skip real state, real execution, or version changes. We need validation under the same rules as production.

What this stage does

Replays the event log, rebuilds canonical state, regenerates signals and scores, compares predicted behaviour against realised behaviour, and reports parity, calibration error, slippage error, Brier, Sharpe, drawdown.

Why this stage exists

This is how we separate genuine edge from wishful backtests. No promotion to runtime-live without parity.

Flow

Raw logsstage 3 event log

Canonicalrebuild state

Signalsregenerate

p_hatscore again

q_execestimate again

Decisionsame input → same output

What the backend should expose

replay_job_id, time window
model_version / signal_version pinned
hash_parity % (replay vs live)
fill_error, slippage_error (replay vs realised)
Brier score, calibration plot
Sharpe, max drawdown (synthetic, not forecast)
verdict (parity-clean · parity-diverged · model-drift)

Maths we expect here

Every formula below is implemented in packages/polytraders-bots/ or packages/polytraders-runner/. Treat the worked example as the unit-test sanity check you should be able to reproduce locally.

Brier score (calibration quality)

\[\mathrm{Brier} = \tfrac{1}{N}\sum_{i=1}^{N}(\hat p_i - y_i)^2\]

Symbol	Meaning	Units / range
$\hat p_i$	Predicted probability for resolved market i	0..1
$y_i$	Realised outcome	{0, 1}
$N$	Number of resolved markets in sample

worked example\[N=500,\; \mathrm{Brier}=0.181\]

Lower is better. Uniform forecast = 0.25. We require ≤ 0.20 over rolling 500-market sample for promotion.

Sharpe ratio (annualised, synthetic)

\[\mathrm{Sharpe} = \frac{\mathrm{mean}(r) - r_f}{\mathrm{std}(r)} \cdot \sqrt{periods\_per\_year}\]

Symbol	Meaning	Units / range
$r$	Per-period return series (daily PnL / capital)	fraction
$r_f$	Risk-free rate per period	fraction
$periods_per_year$	Annualisation factor (252 daily, 12 monthly)

worked example\[mean(r)=0.0009/\text{day},\; std(r)=0.012,\; \sqrt{252} \;\Rightarrow\; \mathrm{Sharpe} \approx 1.19\]

Labelled as synthetic everywhere — backtest Sharpe is not a forecast. Promotion criteria use Sharpe alongside calibration and drawdown, never alone.

Maximum drawdown

\[\mathrm{MaxDD} = \max_{t}\big(\;\max_{s \le t} equity_s - equity_t\;\big)\]

Symbol	Meaning	Units / range
$equity_t$	Equity curve value at time t	USD

worked example\[peak=\$210{,}000,\; trough=\$184{,}800 \;\Rightarrow\; \mathrm{MaxDD} = \$25{,}200\;(12.0\%)\]

Parity hash check

\[\text{parity} \iff \mathrm{SHA256}(replay\_state) = \mathrm{SHA256}(live\_state)\]

Symbol	Meaning	Units / range
$replay_state$	Canonical state recomputed from the event log	struct
$live_state$	Canonical state observed live at the same event_id	struct

worked example\[hash(replay)=\mathtt{0x9a1d\ldots},\;hash(live)=\mathtt{0x9a1d\ldots} \;\Rightarrow\; \text{parity clean}\]

Any divergence blocks promotion. The most common cause is a wall-clock dependency or an un-pinned model_version.

Slippage prediction error

\[\text{slip\_err}_i = slippage^{realised}_i - slippage^{predicted}_i,\quad RMSE = \sqrt{\tfrac{1}{N}\sum_i \text{slip\_err}_i^2}\]

Symbol	Meaning	Units / range
$slippage^{predicted}$	From stage 6 ExecPlan	0..1
$slippage^{realised}$	Computed from the actual fills	0..1

worked example\[N=120,\; RMSE = 0.0011\;(\approx 11\,bps)\]

Stage 11 alerts if slip RMSE exceeds the promotion budget. This is how undetected adverse selection surfaces.

How a developer codes this stage

Reference TypeScript implementation lives in packages/polytraders-* at the repository root. Stage owners maintain these files — read them before writing new code.

packages/polytraders-backtest/src/engine.jsReplay engine — pinned versions, deterministic seeds, parity assertions.
packages/polytraders-synthdata/src/regimes.jsNamed synthetic regimes (calm, news shock, depth crash) for stress replay.
packages/polytraders-runner/src/pipeline.jsThe same pipeline used live, re-driven from the event log.

See it in the platform mock

The platform mock is the source of truth for what each stage's UI exposes. Open these alongside the code references.

↗Parity dashboard/mock-app/replay-parity.html ↗Replay any window/mock-app/data-replay.html ↗Stage 10 — pipeline view/pipelines/shadow-mode.html

Reason codes emitted at this stage

GOV_*GOV — governance, accounting, replay, monitoring
STRAT_*STRAT — strategy, model, fair-value

Hover or tap any reason code on this page (or anywhere on the site) to see its canonical short description. Full registry: /standards/reason-codes.

Symbol	Meaning	Units / range
\(\hat p_i\)	Predicted probability for resolved market i	0..1
\(y_i\)	Realised outcome	{0, 1}
\(N\)	Number of resolved markets in sample

Symbol	Meaning	Units / range
\(r\)	Per-period return series (daily PnL / capital)	fraction
\(r_f\)	Risk-free rate per period	fraction
\(periods_per_year\)	Annualisation factor (252 daily, 12 monthly)

Symbol	Meaning	Units / range
\(replay_state\)	Canonical state recomputed from the event log	struct
\(live_state\)	Canonical state observed live at the same event_id	struct

Symbol	Meaning	Units / range
\(slippage^{predicted}\)	From stage 6 ExecPlan	0..1
\(slippage^{realised}\)	Computed from the actual fills	0..1