Polytraders Dev Guide
internal
v3 spine Phase 1 · Shared contracts 9 demo-wired · 0 shadow-ready · 0 production-live · 100 pending · 109 total 15/33 infra tasks the plan status board

← All stages · stage 10 of 11

Stage 10

Replay, backtesting, and validation

The platform becomes trustworthy when the same raw inputs recreate the same state, the same decisions, and the same simulated fills.

Challenge we are solving

Most backtests tell comforting stories because they skip real state, real execution, or version changes. We need validation under the same rules as production.

What this stage does

Replays the event log, rebuilds canonical state, regenerates signals and scores, compares predicted behaviour against realised behaviour, and reports parity, calibration error, slippage error, Brier, Sharpe, drawdown.

Why this stage exists

This is how we separate genuine edge from wishful backtests. No promotion to runtime-live without parity.

Flow

Raw logsstage 3 event log
Canonicalrebuild state
Signalsregenerate
p_hatscore again
q_execestimate again
Decisionsame input → same output

What the backend should expose

  • replay_job_id, time window
  • model_version / signal_version pinned
  • hash_parity % (replay vs live)
  • fill_error, slippage_error (replay vs realised)
  • Brier score, calibration plot
  • Sharpe, max drawdown (synthetic, not forecast)
  • verdict (parity-clean · parity-diverged · model-drift)

Maths we expect here

Every formula below is implemented in packages/polytraders-bots/ or packages/polytraders-runner/. Treat the worked example as the unit-test sanity check you should be able to reproduce locally.

1

Brier score (calibration quality)

\[\mathrm{Brier} = \tfrac{1}{N}\sum_{i=1}^{N}(\hat p_i - y_i)^2\]
SymbolMeaningUnits / range
\(\hat p_i\)Predicted probability for resolved market i0..1
\(y_i\)Realised outcome{0, 1}
\(N\)Number of resolved markets in sample
worked example\[N=500,\; \mathrm{Brier}=0.181\]

Lower is better. Uniform forecast = 0.25. We require ≤ 0.20 over rolling 500-market sample for promotion.

2

Sharpe ratio (annualised, synthetic)

\[\mathrm{Sharpe} = \frac{\mathrm{mean}(r) - r_f}{\mathrm{std}(r)} \cdot \sqrt{periods\_per\_year}\]
SymbolMeaningUnits / range
\(r\)Per-period return series (daily PnL / capital)fraction
\(r_f\)Risk-free rate per periodfraction
\(periods_per_year\)Annualisation factor (252 daily, 12 monthly)
worked example\[mean(r)=0.0009/\text{day},\; std(r)=0.012,\; \sqrt{252} \;\Rightarrow\; \mathrm{Sharpe} \approx 1.19\]

Labelled as synthetic everywhere — backtest Sharpe is not a forecast. Promotion criteria use Sharpe alongside calibration and drawdown, never alone.

3

Maximum drawdown

\[\mathrm{MaxDD} = \max_{t}\big(\;\max_{s \le t} equity_s - equity_t\;\big)\]
SymbolMeaningUnits / range
\(equity_t\)Equity curve value at time tUSD
worked example\[peak=\$210{,}000,\; trough=\$184{,}800 \;\Rightarrow\; \mathrm{MaxDD} = \$25{,}200\;(12.0\%)\]
4

Parity hash check

\[\text{parity} \iff \mathrm{SHA256}(replay\_state) = \mathrm{SHA256}(live\_state)\]
SymbolMeaningUnits / range
\(replay_state\)Canonical state recomputed from the event logstruct
\(live_state\)Canonical state observed live at the same event_idstruct
worked example\[hash(replay)=\mathtt{0x9a1d\ldots},\;hash(live)=\mathtt{0x9a1d\ldots} \;\Rightarrow\; \text{parity clean}\]

Any divergence blocks promotion. The most common cause is a wall-clock dependency or an un-pinned model_version.

5

Slippage prediction error

\[\text{slip\_err}_i = slippage^{realised}_i - slippage^{predicted}_i,\quad RMSE = \sqrt{\tfrac{1}{N}\sum_i \text{slip\_err}_i^2}\]
SymbolMeaningUnits / range
\(slippage^{predicted}\)From stage 6 ExecPlan0..1
\(slippage^{realised}\)Computed from the actual fills0..1
worked example\[N=120,\; RMSE = 0.0011\;(\approx 11\,bps)\]

Stage 11 alerts if slip RMSE exceeds the promotion budget. This is how undetected adverse selection surfaces.

How a developer codes this stage

Reference TypeScript implementation lives in packages/polytraders-* at the repository root. Stage owners maintain these files — read them before writing new code.

  • packages/polytraders-backtest/src/engine.jsReplay engine — pinned versions, deterministic seeds, parity assertions.
  • packages/polytraders-synthdata/src/regimes.jsNamed synthetic regimes (calm, news shock, depth crash) for stress replay.
  • packages/polytraders-runner/src/pipeline.jsThe same pipeline used live, re-driven from the event log.

See it in the platform mock

The platform mock is the source of truth for what each stage's UI exposes. Open these alongside the code references.

Reason codes emitted at this stage

  • GOV_*GOV — governance, accounting, replay, monitoring
  • STRAT_*STRAT — strategy, model, fair-value

Hover or tap any reason code on this page (or anywhere on the site) to see its canonical short description. Full registry: /standards/reason-codes.