Hito 1: scaffold del paper (estructura, LaTeX revtex4-2, CI, licencias duales)
Some checks failed
noise-harness / run-noise-harness (push) Failing after 1m54s

This commit is contained in:
Sentinel Research 2026-06-24 07:05:37 +02:00
parent c1dab78cc7
commit af990122d1
17 changed files with 361 additions and 1 deletions

31
.github/workflows/noise-harness.yml vendored Normal file
View file

@ -0,0 +1,31 @@
name: noise-harness
on:
pull_request:
push:
branches: [main]
jobs:
run-noise-harness:
runs-on: docker
container:
image: node:20-bookworm
steps:
- uses: actions/checkout@v4
- name: Set up Python
run: |
apt-get update -qq && apt-get install -y -qq python3 python3-pip python3-venv
python3 -m venv .venv
. .venv/bin/activate
pip install -q -r experiments/requirements.txt
if: hashFiles('experiments/requirements.txt') != ''
- name: Run noise harness
run: |
if [ -f experiments/05_noise_harness.py ]; then
. .venv/bin/activate
python3 experiments/05_noise_harness.py
else
echo "noise harness script not present yet — placeholder pass"
fi

29
.gitignore vendored Normal file
View file

@ -0,0 +1,29 @@
# Python
__pycache__/
*.pyc
.venv/
venv/
*.egg-info/
# LaTeX
*.aux
*.bbl
*.bbl-SAVE-ERROR
*.blg
*.fdb_latexmk
*.fls
*.log
*.out
*.synctex.gz
*.toc
*.run.xml
paper/main.pdf
# Large data — never commit raw market data
data/*.parquet
data/*.csv
data/raw/
# Local experiment scratch
results/raw/local-*
.DS_Store

8
AUTHORS.md Normal file
View file

@ -0,0 +1,8 @@
# Authors
**Sentinel Research** — institutional author.
Corresponding author: Daniel Cruces (danielcruces71@gmail.com)
For questions about reproduction, data, or the audit trail, open an issue
on this repository or contact the corresponding author directly.

View file

@ -1,2 +1,40 @@
# lookahead-bias-paper # Lookahead Bias in Vectorized Backtesting: A Noise Harness Diagnostic
Repository for the paper documenting a lookahead-bias defect found in a
vectorized backtester (the "K12" kernel), and a noise-harness methodology
to detect this class of bug using pure Geometric Brownian Motion data.
## Structure
```
paper/ LaTeX source (revtex4-2)
experiments/ Description and specs of the 5 planned experiments
data/ Instructions to obtain the BTCUSDT 1m dataset (no large binaries in git)
results/ Raw experiment outputs (json/csv), git-tracked once produced
audit/input/ Forensic copy of the original K12 code/data/results for reproduction
```
## How to reproduce
1. Read `experiments/README.md` for the experiment list and what each one tests.
2. Read `data/README.md` to obtain the dataset (or regenerate synthetic GBM data).
3. Run the scripts under `experiments/` (CI runs the noise harness automatically
on every PR, see `.github/workflows/noise-harness.yml`).
4. Compare your output against the reference files in `results/`.
## License — read this before reusing anything
This repository carries **two separate licenses** for two separate kinds of content:
| Content | License | File |
|---|---|---|
| Code: experiment scripts, harness, CI workflows, anything under `experiments/`, `data/`, `.github/` | **MIT** | [`LICENSE`](LICENSE) |
| Paper text and figures, anything under `paper/` | **CC-BY 4.0** | [`LICENSE-TEXT-CC-BY-4.0.md`](LICENSE-TEXT-CC-BY-4.0.md) |
Use the code freely, commercially or not, with attribution (MIT terms).
Reuse the paper text/figures freely, commercially or not, with attribution (CC-BY 4.0 terms).
These are independent grants — reusing the code does not require complying with CC-BY, and vice versa.
## Authorship
See [`AUTHORS.md`](AUTHORS.md).

30
data/README.md Normal file
View file

@ -0,0 +1,30 @@
# Data
This repository does not track raw market data (see `.gitignore`). Large
binaries don't belong in a public git repo, and Binance's public API makes
the data trivially reconstructible.
## BTCUSDT 1-minute OHLCV
Used in experiments 2 and 3 (baseline and honest replication against real
data). To obtain it:
1. If `download_data.py` exists in this directory, run it — it pulls the
exact date range used in the original experiment from the public Binance
API and writes `BTCUSDT_1m.parquet`.
2. Verify the SHA-256 hash of the resulting file matches the one recorded in
`audit/input/MANIFEST.md` (forensic record of the original dataset used
when the bug was found).
```bash
sha256sum BTCUSDT_1m.parquet
```
If the hash doesn't match, the date range or Binance API response has
drifted — do not proceed with replication until it's reconciled.
## Synthetic GBM data (noise harness)
Generated on the fly by `experiments/01_generate_gbm.py`. No download
needed — this is the point of using synthetic null data: it requires zero
external dependency and is perfectly reproducible from a seed.

27
experiments/README.md Normal file
View file

@ -0,0 +1,27 @@
# Experiments
Five experiments, run in order. Each script is `0N_<name>.py`. Scripts that
don't exist yet are listed here as a spec so the CI workflow and the paper's
Section 6 stay in sync with what's actually implemented.
| # | Script | Purpose | Status |
|---|--------|---------|--------|
| 1 | `01_generate_gbm.py` | Generate pure-noise GBM price series (fixed seed, documented params) | pending |
| 2 | `02_baseline_replication.py` | Run K12 golden hyperparameters on real BTCUSDT 1m, buggy backtester → expect Sharpe ≈ 14.49 | pending — needs `audit/input/code` |
| 3 | `03_honest_replication.py` | Same hyperparameters/data, `time_machine.py` engine → expect Sharpe ≈ -0.25 | pending — needs `audit/input/code` |
| 4 | `04_noise_control.py` | Run both engines across ≥30 independent GBM seeds, compare Sharpe distributions | pending |
| 5 | `05_noise_harness.py` | CI-gating version of experiment 4: fails the build if mean Sharpe on noise falls outside a pre-registered null band | pending |
## Reproducibility rules
- Every script must take `--seed` and print it in its output.
- Every output JSON must include: seed, kernel version/hash, library versions
(numpy/pandas), and a UTC timestamp.
- No script reads from `audit/input/` directly in a way that would couple the
public reproduction path to the forensic copy — `audit/input/` is for our
own verification, not for the published reproduction instructions.
## Environment
Pin dependencies in `requirements.txt` (to be added alongside the first
script). CI installs from that file — see `.github/workflows/noise-harness.yml`.

39
paper/main.tex Normal file
View file

@ -0,0 +1,39 @@
\documentclass[aps,onecolumn,nofootinbib,floatfix]{revtex4-2}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{hyperref}
\usepackage{booktabs}
\begin{document}
\title{Lookahead Bias in Vectorized Backtesting: A Noise Harness Diagnostic}
\author{Sentinel Research}
\affiliation{Sentinel Research}
\email{danielcruces71@gmail.com}
\date{\today}
\begin{abstract}
% TODO: 150-250 words. Must state: (1) the defect found (lookahead bias in a
% vectorized backtester), (2) the diagnostic method (pure-noise GBM harness),
% (3) the headline numbers (Sharpe 14.49 under the bug vs Sharpe -0.25 fixed),
% (4) why this matters for anyone running vectorized backtests at scale.
\end{abstract}
\maketitle
\input{sections/01_introduction}
\input{sections/02_related_work}
\input{sections/03_problem_formalization}
\input{sections/04_the_lookahead_bug}
\input{sections/05_noise_harness_methodology}
\input{sections/06_experimental_setup}
\input{sections/07_results}
\input{sections/08_discussion_and_conclusion}
\bibliography{references}
\end{document}

7
paper/references.bib Normal file
View file

@ -0,0 +1,7 @@
% Bibliography for "Lookahead Bias in Vectorized Backtesting"
% Populate in Hito 2. Suggested starting points to look up and add:
% - Bailey, D.H. & Lopez de Prado, M., "The Deflated Sharpe Ratio"
% - Bailey, D.H. et al., "Pseudo-Mathematics and Financial Charlatanism"
% - Bailey, D.H. & Lopez de Prado, M., "The Probability of Backtest Overfitting"
% - Harvey, C.R., Liu, Y., Zhu, H., "...and the Cross-Section of Expected Returns"
% - White, H., "A Reality Check for Data Snooping"

View file

@ -0,0 +1,15 @@
\section{Introduction}
\label{sec:introduction}
% TODO content notes:
% - Motivate why backtest correctness matters: a single off-by-one index in a
% vectorized backtester can silently fabricate alpha.
% - State the concrete finding up front: a 15-hyperparameter "golden kernel"
% (K12 / Iter12) produced via genetic-algorithm search showed Sharpe 14.49
% on BTCUSDT 1m data — and the same kernel, run honestly (bar-by-bar, no
% future information), collapses to Sharpe -0.25.
% - Frame the contribution: not just "we found a bug," but a reusable
% noise-harness methodology (Section 5) that any quant team can run against
% their own backtester to detect this class of defect using pure
% Geometric Brownian Motion data with zero real signal.
% - End with a roadmap of the paper (one sentence per remaining section).

View file

@ -0,0 +1,17 @@
\section{Related Work}
\label{sec:related-work}
% TODO content notes:
% - Lookahead bias / data leakage in backtesting: cite the standard
% references (e.g. Bailey & Lopez de Prado on backtest overfitting,
% "pseudo-mathematics" critiques, Probability of Backtest Overfitting).
% - Vectorized vs event-driven backtesting engines: tradeoffs in speed vs
% correctness; vectorized engines are more prone to index-alignment bugs
% because there is no explicit "current bar" boundary enforced by the loop.
% - Synthetic-data / null-model testing in finance: permutation tests,
% Monte Carlo null models, white-noise sanity checks as a general technique
% to detect overfit or leaky strategies before risking capital.
% - Position this paper: distinct from prior work in that it (a) documents a
% live, reproducible incident with full forensic trail, and (b) packages
% the diagnostic as a minimal, CI-runnable harness (Section 5) rather than
% a one-off statistical test.

View file

@ -0,0 +1,18 @@
\section{Problem Formalization}
\label{sec:formalization}
% TODO content notes:
% - Define the backtest setting formally: price series P_t, signal S_t,
% position p_t, and the honesty constraint p_t = f(P_{<=t}, S_{<=t}) only
% (no access to P_{>t}).
% - Define lookahead bias precisely: any computation where p_t depends,
% directly or via a vectorized operation (e.g. shift(-1), rolling window
% misaligned by one bar, future-looking groupby), on P_{>t}.
% - Show the general shape of the bug class in vectorized code: a single
% missing .shift(1) or an inclusive/exclusive boundary error in a rolling
% window. Use abstract pseudocode here; the concrete K12 diff goes in
% Section 4.
% - State the falsifiability criterion that motivates Section 5: if a
% strategy is profitable on data with zero true signal (pure GBM noise),
% the profit must be an artifact of the backtest mechanics, not of the
% strategy logic.

View file

@ -0,0 +1,17 @@
\section{The K12 Lookahead Bug}
\label{sec:the-bug}
% TODO content notes (fill in once audit/input/ is populated and audited):
% - Identify the exact line(s) in backtester.py responsible for the leak.
% - Show a minimal before/after diff.
% - Explain mechanically why the genetic algorithm search (Iter12, 15
% hyperparameters) was able to find and exploit this leak: GA optimizes
% whatever signal is available, including backtest-mechanics artifacts: if
% the fitness function rewards future-peeking, the search converges on
% parameters that maximize the exploit, not real predictive skill.
% - Quantify the leak's effect size on the original (real BTCUSDT) data:
% Sharpe under the bug vs Sharpe under time_machine.py (the honest engine),
% same hyperparameters, same data.
% - This section depends on the forensic files under audit/input/code/ —
% do not fill in specifics until that audit is complete (see MANIFEST.md
% for the exact commit hash and dataset hash being audited).

View file

@ -0,0 +1,20 @@
\section{The Noise Harness Methodology}
\label{sec:noise-harness}
% TODO content notes:
% - Describe the GBM null-data generator: dS = mu*S*dt + sigma*S*dW, fixed
% seed, parameters (mu, sigma, N steps, start price) documented in
% experiments/README.md and reproduced in 01_generate_gbm.py.
% - Key property to state explicitly: this series has zero exploitable
% structure by construction — no autocorrelation edge, no regime, nothing
% a real strategy could legitimately learn.
% - Define the test: run the same kernel/backtester pipeline against N
% independent GBM seeds. A backtester free of lookahead bias should
% produce a Sharpe distribution centered at ~0 across seeds. A backtester
% with a leak will produce a systematically positive Sharpe regardless of
% seed, because the "edge" comes from the mechanics, not the data.
% - State this as a pass/fail CI gate: mean Sharpe over >= 30 seeds must
% fall within a pre-registered null band (e.g. -0.3 to 0.3); anything
% outside that band fails the build. This is what
% .github/workflows/noise-harness.yml is wired to enforce once
% experiments/05_noise_harness.py exists.

View file

@ -0,0 +1,23 @@
\section{Experimental Setup}
\label{sec:experimental-setup}
% TODO content notes:
% - Enumerate the 5 experiments (full spec lives in experiments/README.md,
% this section is the paper-facing summary):
% 1. Baseline replication: run K12 golden hyperparameters on real BTCUSDT
% 1m data, buggy backtester, reproduce Sharpe 14.49.
% 2. Honest replication: same hyperparameters, same data, time_machine.py
% (bar-by-bar, no lookahead), reproduce Sharpe -0.25.
% 3. Noise harness on buggy backtester: same hyperparameters, >=30 GBM
% seeds, buggy backtester. Expect systematically positive Sharpe.
% 4. Noise harness on honest backtester: same setup, time_machine.py.
% Expect Sharpe distribution centered at 0.
% 5. Sensitivity check: vary the lookahead window size synthetically
% (1-bar through N-bar leak) to show Sharpe scales with leak size, not
% coincidence.
% - State software/hardware environment: pinned in env/requirements.txt,
% env/python_version.txt, env/os_info.txt under audit/input/ (forensic)
% and experiments/requirements.txt (reproduction environment going
% forward, which may differ in version but not in semantics).
% - State exactly which files are authoritative for each experiment number
% once experiments/0N_*.py scripts exist.

View file

@ -0,0 +1,19 @@
\section{Results}
\label{sec:results}
% TODO content notes:
% - Table 1: side-by-side Sharpe (and other metrics: max drawdown, win rate,
% total return) for buggy vs honest backtester on real data. This is the
% "leak signature" headline result.
% - Figure 1 (leak_signature.png): equity curve comparison, buggy vs honest,
% same hyperparameters, real data.
% - Figure 2 (noise_control.png): histogram/distribution of Sharpe across
% >=30 GBM seeds, buggy backtester overlaid with honest backtester. The
% buggy distribution should be visibly shifted positive; the honest one
% centered near 0.
% - Figure 3 (fix_comparison.png): before/after of the actual code diff that
% fixed the leak, annotated with the Sharpe delta it caused.
% - Do not write actual numbers into this section until results/raw/*.json
% exist and have been validated against the audit/input/results/ originals
% (see MANIFEST.md hashes). Every number in this section must be traceable
% to a specific file + seed + commit.

View file

@ -0,0 +1,22 @@
\section{Discussion and Conclusion}
\label{sec:discussion}
% TODO content notes:
% - Generalize beyond this one kernel: any vectorized backtester using
% pandas/numpy rolling/shift operations is at risk of this exact class of
% bug; it is not specific to genetic-algorithm-discovered strategies.
% - Practical recommendation: every backtesting pipeline should run the
% noise harness (Section 5) as a standing CI gate, the same way unit tests
% gate merges — not as a one-off audit.
% - Limitations: the noise harness detects backtest-mechanics leaks; it does
% NOT detect overfitting to real historical data (that is a distinct
% failure mode requiring out-of-sample / walk-forward validation,
% out of scope here).
% - Disclosure note: state plainly that this defect was found in an
% internal/proprietary research pipeline (Sentinel Research), and that
% this paper publishes the diagnostic methodology and a minimal
% reproduction, not the proprietary strategy code itself.
% - One-paragraph conclusion restating the core claim: Sharpe 14.49 on pure
% noise is not skill, it is a bug signature, and the fix collapsed it to
% Sharpe -0.25 — exactly the kind of result a noise harness exists to
% catch before capital is at risk.

0
results/.gitkeep Normal file
View file