Hito 1: scaffold del paper (estructura, LaTeX revtex4-2, CI, licencias duales)

2026-06-24 07:05:37 +02:00 · 2026-06-24 07:05:37 +02:00 · af990122d1
commit af990122d1
parent c1dab78cc7
17 changed files with 361 additions and 1 deletions
--- a/.github/workflows/noise-harness.yml
+++ b/.github/workflows/noise-harness.yml
@ -0,0 +1,31 @@
+name: noise-harness
+
+on:
+  pull_request:
+  push:
+    branches: [main]
+
+jobs:
+  run-noise-harness:
+    runs-on: docker
+    container:
+      image: node:20-bookworm
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        run: |
+          apt-get update -qq && apt-get install -y -qq python3 python3-pip python3-venv
+          python3 -m venv .venv
+          . .venv/bin/activate
+          pip install -q -r experiments/requirements.txt
+        if: hashFiles('experiments/requirements.txt') != ''
+
+      - name: Run noise harness
+        run: |
+          if [ -f experiments/05_noise_harness.py ]; then
+            . .venv/bin/activate
+            python3 experiments/05_noise_harness.py
+          else
+            echo "noise harness script not present yet — placeholder pass"
+          fi
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,29 @@
+# Python
+__pycache__/
+*.pyc
+.venv/
+venv/
+*.egg-info/
+
+# LaTeX
+*.aux
+*.bbl
+*.bbl-SAVE-ERROR
+*.blg
+*.fdb_latexmk
+*.fls
+*.log
+*.out
+*.synctex.gz
+*.toc
+*.run.xml
+paper/main.pdf
+
+# Large data — never commit raw market data
+data/*.parquet
+data/*.csv
+data/raw/
+
+# Local experiment scratch
+results/raw/local-*
+.DS_Store
--- a/AUTHORS.md
+++ b/AUTHORS.md
@ -0,0 +1,8 @@
+# Authors
+
+**Sentinel Research** — institutional author.
+
+Corresponding author: Daniel Cruces (danielcruces71@gmail.com)
+
+For questions about reproduction, data, or the audit trail, open an issue
+on this repository or contact the corresponding author directly.
--- a/README.md
+++ b/README.md
@ -1,2 +1,40 @@
-# lookahead-bias-paper
+# Lookahead Bias in Vectorized Backtesting: A Noise Harness Diagnostic

+Repository for the paper documenting a lookahead-bias defect found in a
+vectorized backtester (the "K12" kernel), and a noise-harness methodology
+to detect this class of bug using pure Geometric Brownian Motion data.
+
+## Structure
+
+```
+paper/        LaTeX source (revtex4-2)
+experiments/  Description and specs of the 5 planned experiments
+data/         Instructions to obtain the BTCUSDT 1m dataset (no large binaries in git)
+results/      Raw experiment outputs (json/csv), git-tracked once produced
+audit/input/  Forensic copy of the original K12 code/data/results for reproduction
+```
+
+## How to reproduce
+
+1. Read `experiments/README.md` for the experiment list and what each one tests.
+2. Read `data/README.md` to obtain the dataset (or regenerate synthetic GBM data).
+3. Run the scripts under `experiments/` (CI runs the noise harness automatically
+   on every PR, see `.github/workflows/noise-harness.yml`).
+4. Compare your output against the reference files in `results/`.
+
+## License — read this before reusing anything
+
+This repository carries **two separate licenses** for two separate kinds of content:
+
+| Content | License | File |
+|---|---|---|
+| Code: experiment scripts, harness, CI workflows, anything under `experiments/`, `data/`, `.github/` | **MIT** | [`LICENSE`](LICENSE) |
+| Paper text and figures, anything under `paper/` | **CC-BY 4.0** | [`LICENSE-TEXT-CC-BY-4.0.md`](LICENSE-TEXT-CC-BY-4.0.md) |
+
+Use the code freely, commercially or not, with attribution (MIT terms).
+Reuse the paper text/figures freely, commercially or not, with attribution (CC-BY 4.0 terms).
+These are independent grants — reusing the code does not require complying with CC-BY, and vice versa.
+
+## Authorship
+
+See [`AUTHORS.md`](AUTHORS.md).
--- a/data/README.md
+++ b/data/README.md
@ -0,0 +1,30 @@
+# Data
+
+This repository does not track raw market data (see `.gitignore`). Large
+binaries don't belong in a public git repo, and Binance's public API makes
+the data trivially reconstructible.
+
+## BTCUSDT 1-minute OHLCV
+
+Used in experiments 2 and 3 (baseline and honest replication against real
+data). To obtain it:
+
+1. If `download_data.py` exists in this directory, run it — it pulls the
+   exact date range used in the original experiment from the public Binance
+   API and writes `BTCUSDT_1m.parquet`.
+2. Verify the SHA-256 hash of the resulting file matches the one recorded in
+   `audit/input/MANIFEST.md` (forensic record of the original dataset used
+   when the bug was found).
+
+```bash
+sha256sum BTCUSDT_1m.parquet
+```
+
+If the hash doesn't match, the date range or Binance API response has
+drifted — do not proceed with replication until it's reconciled.
+
+## Synthetic GBM data (noise harness)
+
+Generated on the fly by `experiments/01_generate_gbm.py`. No download
+needed — this is the point of using synthetic null data: it requires zero
+external dependency and is perfectly reproducible from a seed.
--- a/experiments/README.md
+++ b/experiments/README.md
@ -0,0 +1,27 @@
+# Experiments
+
+Five experiments, run in order. Each script is `0N_<name>.py`. Scripts that
+don't exist yet are listed here as a spec so the CI workflow and the paper's
+Section 6 stay in sync with what's actually implemented.
+
+| # | Script | Purpose | Status |
+|---|--------|---------|--------|
+| 1 | `01_generate_gbm.py` | Generate pure-noise GBM price series (fixed seed, documented params) | pending |
+| 2 | `02_baseline_replication.py` | Run K12 golden hyperparameters on real BTCUSDT 1m, buggy backtester → expect Sharpe ≈ 14.49 | pending — needs `audit/input/code` |
+| 3 | `03_honest_replication.py` | Same hyperparameters/data, `time_machine.py` engine → expect Sharpe ≈ -0.25 | pending — needs `audit/input/code` |
+| 4 | `04_noise_control.py` | Run both engines across ≥30 independent GBM seeds, compare Sharpe distributions | pending |
+| 5 | `05_noise_harness.py` | CI-gating version of experiment 4: fails the build if mean Sharpe on noise falls outside a pre-registered null band | pending |
+
+## Reproducibility rules
+
+- Every script must take `--seed` and print it in its output.
+- Every output JSON must include: seed, kernel version/hash, library versions
+  (numpy/pandas), and a UTC timestamp.
+- No script reads from `audit/input/` directly in a way that would couple the
+  public reproduction path to the forensic copy — `audit/input/` is for our
+  own verification, not for the published reproduction instructions.
+
+## Environment
+
+Pin dependencies in `requirements.txt` (to be added alongside the first
+script). CI installs from that file — see `.github/workflows/noise-harness.yml`.
--- a/paper/main.tex
+++ b/paper/main.tex
@ -0,0 +1,39 @@
+\documentclass[aps,onecolumn,nofootinbib,floatfix]{revtex4-2}
+
+\usepackage{graphicx}
+\usepackage{amsmath}
+\usepackage{amssymb}
+\usepackage{hyperref}
+\usepackage{booktabs}
+
+\begin{document}
+
+\title{Lookahead Bias in Vectorized Backtesting: A Noise Harness Diagnostic}
+
+\author{Sentinel Research}
+\affiliation{Sentinel Research}
+\email{danielcruces71@gmail.com}
+
+\date{\today}
+
+\begin{abstract}
+% TODO: 150-250 words. Must state: (1) the defect found (lookahead bias in a
+% vectorized backtester), (2) the diagnostic method (pure-noise GBM harness),
+% (3) the headline numbers (Sharpe 14.49 under the bug vs Sharpe -0.25 fixed),
+% (4) why this matters for anyone running vectorized backtests at scale.
+\end{abstract}
+
+\maketitle
+
+\input{sections/01_introduction}
+\input{sections/02_related_work}
+\input{sections/03_problem_formalization}
+\input{sections/04_the_lookahead_bug}
+\input{sections/05_noise_harness_methodology}
+\input{sections/06_experimental_setup}
+\input{sections/07_results}
+\input{sections/08_discussion_and_conclusion}
+
+\bibliography{references}
+
+\end{document}
--- a/paper/references.bib
+++ b/paper/references.bib
@ -0,0 +1,7 @@
+% Bibliography for "Lookahead Bias in Vectorized Backtesting"
+% Populate in Hito 2. Suggested starting points to look up and add:
+% - Bailey, D.H. & Lopez de Prado, M., "The Deflated Sharpe Ratio"
+% - Bailey, D.H. et al., "Pseudo-Mathematics and Financial Charlatanism"
+% - Bailey, D.H. & Lopez de Prado, M., "The Probability of Backtest Overfitting"
+% - Harvey, C.R., Liu, Y., Zhu, H., "...and the Cross-Section of Expected Returns"
+% - White, H., "A Reality Check for Data Snooping"
--- a/paper/sections/01_introduction.tex
+++ b/paper/sections/01_introduction.tex
@ -0,0 +1,15 @@
+\section{Introduction}
+\label{sec:introduction}
+
+% TODO content notes:
+% - Motivate why backtest correctness matters: a single off-by-one index in a
+%   vectorized backtester can silently fabricate alpha.
+% - State the concrete finding up front: a 15-hyperparameter "golden kernel"
+%   (K12 / Iter12) produced via genetic-algorithm search showed Sharpe 14.49
+%   on BTCUSDT 1m data — and the same kernel, run honestly (bar-by-bar, no
+%   future information), collapses to Sharpe -0.25.
+% - Frame the contribution: not just "we found a bug," but a reusable
+%   noise-harness methodology (Section 5) that any quant team can run against
+%   their own backtester to detect this class of defect using pure
+%   Geometric Brownian Motion data with zero real signal.
+% - End with a roadmap of the paper (one sentence per remaining section).
--- a/paper/sections/02_related_work.tex
+++ b/paper/sections/02_related_work.tex
@ -0,0 +1,17 @@
+\section{Related Work}
+\label{sec:related-work}
+
+% TODO content notes:
+% - Lookahead bias / data leakage in backtesting: cite the standard
+%   references (e.g. Bailey & Lopez de Prado on backtest overfitting,
+%   "pseudo-mathematics" critiques, Probability of Backtest Overfitting).
+% - Vectorized vs event-driven backtesting engines: tradeoffs in speed vs
+%   correctness; vectorized engines are more prone to index-alignment bugs
+%   because there is no explicit "current bar" boundary enforced by the loop.
+% - Synthetic-data / null-model testing in finance: permutation tests,
+%   Monte Carlo null models, white-noise sanity checks as a general technique
+%   to detect overfit or leaky strategies before risking capital.
+% - Position this paper: distinct from prior work in that it (a) documents a
+%   live, reproducible incident with full forensic trail, and (b) packages
+%   the diagnostic as a minimal, CI-runnable harness (Section 5) rather than
+%   a one-off statistical test.
--- a/paper/sections/03_problem_formalization.tex
+++ b/paper/sections/03_problem_formalization.tex
@ -0,0 +1,18 @@
+\section{Problem Formalization}
+\label{sec:formalization}
+
+% TODO content notes:
+% - Define the backtest setting formally: price series P_t, signal S_t,
+%   position p_t, and the honesty constraint p_t = f(P_{<=t}, S_{<=t}) only
+%   (no access to P_{>t}).
+% - Define lookahead bias precisely: any computation where p_t depends,
+%   directly or via a vectorized operation (e.g. shift(-1), rolling window
+%   misaligned by one bar, future-looking groupby), on P_{>t}.
+% - Show the general shape of the bug class in vectorized code: a single
+%   missing .shift(1) or an inclusive/exclusive boundary error in a rolling
+%   window. Use abstract pseudocode here; the concrete K12 diff goes in
+%   Section 4.
+% - State the falsifiability criterion that motivates Section 5: if a
+%   strategy is profitable on data with zero true signal (pure GBM noise),
+%   the profit must be an artifact of the backtest mechanics, not of the
+%   strategy logic.
--- a/paper/sections/04_the_lookahead_bug.tex
+++ b/paper/sections/04_the_lookahead_bug.tex
@ -0,0 +1,17 @@
+\section{The K12 Lookahead Bug}
+\label{sec:the-bug}
+
+% TODO content notes (fill in once audit/input/ is populated and audited):
+% - Identify the exact line(s) in backtester.py responsible for the leak.
+% - Show a minimal before/after diff.
+% - Explain mechanically why the genetic algorithm search (Iter12, 15
+%   hyperparameters) was able to find and exploit this leak: GA optimizes
+%   whatever signal is available, including backtest-mechanics artifacts: if
+%   the fitness function rewards future-peeking, the search converges on
+%   parameters that maximize the exploit, not real predictive skill.
+% - Quantify the leak's effect size on the original (real BTCUSDT) data:
+%   Sharpe under the bug vs Sharpe under time_machine.py (the honest engine),
+%   same hyperparameters, same data.
+% - This section depends on the forensic files under audit/input/code/ —
+%   do not fill in specifics until that audit is complete (see MANIFEST.md
+%   for the exact commit hash and dataset hash being audited).
--- a/paper/sections/05_noise_harness_methodology.tex
+++ b/paper/sections/05_noise_harness_methodology.tex
@ -0,0 +1,20 @@
+\section{The Noise Harness Methodology}
+\label{sec:noise-harness}
+
+% TODO content notes:
+% - Describe the GBM null-data generator: dS = mu*S*dt + sigma*S*dW, fixed
+%   seed, parameters (mu, sigma, N steps, start price) documented in
+%   experiments/README.md and reproduced in 01_generate_gbm.py.
+% - Key property to state explicitly: this series has zero exploitable
+%   structure by construction — no autocorrelation edge, no regime, nothing
+%   a real strategy could legitimately learn.
+% - Define the test: run the same kernel/backtester pipeline against N
+%   independent GBM seeds. A backtester free of lookahead bias should
+%   produce a Sharpe distribution centered at ~0 across seeds. A backtester
+%   with a leak will produce a systematically positive Sharpe regardless of
+%   seed, because the "edge" comes from the mechanics, not the data.
+% - State this as a pass/fail CI gate: mean Sharpe over >= 30 seeds must
+%   fall within a pre-registered null band (e.g. -0.3 to 0.3); anything
+%   outside that band fails the build. This is what
+%   .github/workflows/noise-harness.yml is wired to enforce once
+%   experiments/05_noise_harness.py exists.
--- a/paper/sections/06_experimental_setup.tex
+++ b/paper/sections/06_experimental_setup.tex
@ -0,0 +1,23 @@
+\section{Experimental Setup}
+\label{sec:experimental-setup}
+
+% TODO content notes:
+% - Enumerate the 5 experiments (full spec lives in experiments/README.md,
+%   this section is the paper-facing summary):
+%   1. Baseline replication: run K12 golden hyperparameters on real BTCUSDT
+%      1m data, buggy backtester, reproduce Sharpe 14.49.
+%   2. Honest replication: same hyperparameters, same data, time_machine.py
+%      (bar-by-bar, no lookahead), reproduce Sharpe -0.25.
+%   3. Noise harness on buggy backtester: same hyperparameters, >=30 GBM
+%      seeds, buggy backtester. Expect systematically positive Sharpe.
+%   4. Noise harness on honest backtester: same setup, time_machine.py.
+%      Expect Sharpe distribution centered at 0.
+%   5. Sensitivity check: vary the lookahead window size synthetically
+%      (1-bar through N-bar leak) to show Sharpe scales with leak size, not
+%      coincidence.
+% - State software/hardware environment: pinned in env/requirements.txt,
+%   env/python_version.txt, env/os_info.txt under audit/input/ (forensic)
+%   and experiments/requirements.txt (reproduction environment going
+%   forward, which may differ in version but not in semantics).
+% - State exactly which files are authoritative for each experiment number
+%   once experiments/0N_*.py scripts exist.
--- a/paper/sections/07_results.tex
+++ b/paper/sections/07_results.tex
@ -0,0 +1,19 @@
+\section{Results}
+\label{sec:results}
+
+% TODO content notes:
+% - Table 1: side-by-side Sharpe (and other metrics: max drawdown, win rate,
+%   total return) for buggy vs honest backtester on real data. This is the
+%   "leak signature" headline result.
+% - Figure 1 (leak_signature.png): equity curve comparison, buggy vs honest,
+%   same hyperparameters, real data.
+% - Figure 2 (noise_control.png): histogram/distribution of Sharpe across
+%   >=30 GBM seeds, buggy backtester overlaid with honest backtester. The
+%   buggy distribution should be visibly shifted positive; the honest one
+%   centered near 0.
+% - Figure 3 (fix_comparison.png): before/after of the actual code diff that
+%   fixed the leak, annotated with the Sharpe delta it caused.
+% - Do not write actual numbers into this section until results/raw/*.json
+%   exist and have been validated against the audit/input/results/ originals
+%   (see MANIFEST.md hashes). Every number in this section must be traceable
+%   to a specific file + seed + commit.
--- a/paper/sections/08_discussion_and_conclusion.tex
+++ b/paper/sections/08_discussion_and_conclusion.tex
@ -0,0 +1,22 @@
+\section{Discussion and Conclusion}
+\label{sec:discussion}
+
+% TODO content notes:
+% - Generalize beyond this one kernel: any vectorized backtester using
+%   pandas/numpy rolling/shift operations is at risk of this exact class of
+%   bug; it is not specific to genetic-algorithm-discovered strategies.
+% - Practical recommendation: every backtesting pipeline should run the
+%   noise harness (Section 5) as a standing CI gate, the same way unit tests
+%   gate merges — not as a one-off audit.
+% - Limitations: the noise harness detects backtest-mechanics leaks; it does
+%   NOT detect overfitting to real historical data (that is a distinct
+%   failure mode requiring out-of-sample / walk-forward validation,
+%   out of scope here).
+% - Disclosure note: state plainly that this defect was found in an
+%   internal/proprietary research pipeline (Sentinel Research), and that
+%   this paper publishes the diagnostic methodology and a minimal
+%   reproduction, not the proprietary strategy code itself.
+% - One-paragraph conclusion restating the core claim: Sharpe 14.49 on pure
+%   noise is not skill, it is a bug signature, and the fix collapsed it to
+%   Sharpe -0.25 — exactly the kind of result a noise harness exists to
+%   catch before capital is at risk.
--- a/results/.gitkeep
+++ b/results/.gitkeep