This repository contains a Python research pipeline for an independent replication and extension of Kim, Korajczyk, and Neuhierl (2021), Arbitrage Portfolios. The project is designed to show the full research workflow behind a cross-sectional asset-pricing strategy: data organization, point-in-time panel construction, feature engineering, rolling portfolio formation, and backtest evaluation.
This repository includes a synthetic-data demo so reviewers can run the core pipeline without proprietary Sharadar / Nasdaq Data Link data.
This is research code, not investment advice. The synthetic demo is for software and workflow review only; its numeric output is not evidence of an investable trading strategy.
For the empirical write-up and results summary, see the project page: https://danielpellatt.github.io/project.html. For the full real-data reproduction workflow, see docs/README_extended.md.
The demo starts from the committed synthetic Layer-B-style inputs under:
synthetic_demo/data/
It runs:
Layer C: feature construction
Layer D: rolling Kim et al. portfolio-weight estimation
Layer E: backtest summaries, plots, turnover/exposure diagnostics, and factor regressions
It intentionally skips:
Layer A: proprietary Sharadar data ingestion
Layer B: proprietary-data panel construction
D0 / cost tuning: research orchestration used in the full empirical workflow
Run these commands from the repository root. Python 3.11 through 3.14 is supported by the project metadata.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"On Windows PowerShell:
py -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"After installing the development dependencies:
python -m pytest -qpython scripts/run_synthetic_demo.pyThe runner sets ALPHA_PORTFOLIO_PROJECT_DATA_DIR=synthetic_demo/data for the current process, computes the demo feature subset, estimates kim_etal weights, and writes evaluation outputs under the synthetic demo data folder.
Key generated files are written under:
synthetic_demo/data/processed/
Useful files to inspect:
panel_combined_M.parquet/ # Layer C processed feature panel
panel_combined_M_feature_diagnostics.parquet # feature coverage diagnostics
weights_M.parquet # Layer D portfolio weights
rebalance_diagnostics_M.parquet # Layer D eligibility / universe diagnostics
layer_e_M/summary_by_window.csv # return summaries
layer_e_M/factor_regression_summary.csv # factor-regression alpha summary
layer_e_M/factor_regression_coefficients.csv # regression coefficients
layer_e_M/plots/ # Layer E plot bundle
The public demo uses a smaller explicit feature set so it runs quickly and remains easy to inspect:
LME, BEME, A2ME, S2P, Debt2P, ROA, Investment,
r2_1, r6_2, r12_2, r12_7, r36_13,
LTurnover, Total_vol, Ret_max, Rel_to_High, Spread, SUV
The synthetic data are internally connected: each synthetic stock has one stable permaticker, one synthetic ticker, daily prices, monthly panel rows, fundamental-style fields, risk-free fields, and parser-compatible synthetic FRED / Ken French files. The data are artificial and should only be used to verify the repository structure and pipeline execution.
This project is meant to demonstrate practical quantitative research engineering:
- point-in-time data handling and leakage controls;
- modular pipeline layers rather than one-off notebooks;
- implementation of an academic asset-pricing model from scratch;
- feature construction for a cross-sectional return-prediction panel;
- rolling-window portfolio estimation;
- transaction-cost-aware backtest diagnostics;
- factor regressions against CAPM, Fama-French 3-factor, Fama-French 5-factor, and momentum-augmented models;
- a separated model-selection / validation / holdout workflow for the nonlinear extensions.
The full real-data workflow uses proprietary Sharadar / Nasdaq Data Link data, local FRED CSVs, and local Ken French factor CSVs. The synthetic demo replaces those proprietary inputs with synthetic Layer-B-style inputs.
src/alpha_portfolio_project/
data/ # Layer A helpers, vendor ZIP ingest, FRED / risk-free utilities
panels/ # Layer B panel construction and point-in-time utilities
features/ # Layer C feature registry and feature engine
models/ # Layer D estimator implementations
pipelines/ # runnable pipeline layers A/B/C/D0/D/E
utils/ # shared helpers
scripts/ # terminal entry points
configs/ # YAML configs for demo, full runs, D0, cost tuning, and final reports
docs/ # extended documentation, feature catalog, synthetic schema, model note
data/ # local-only real-data cache; gitignored
synthetic_demo/ # committed synthetic data bundle for the public demo
| Layer | Purpose | Public demo? | Full real-data run? |
|---|---|---|---|
| A | Ingest Sharadar bulk ZIPs into local Parquet | skipped | yes |
| B | Build point-in-time panel inputs, returns, risk-free fields, fundamentals, delisting adjustments | skipped; synthetic replacement provided | yes |
| C | Compute stock characteristics / features | yes | yes |
| D | Estimate rolling portfolio weights | yes, kim_etal only |
yes |
| D0 | Signal-selection orchestration for D00 / D01 | not part of quick demo | optional but used in the full research workflow |
| D cost tuning | Post-D0 implementation sweep over volatility targets, no-trade bands, and gross caps | not part of quick demo | optional but used in the full research workflow |
| E | Backtest returns, exposures, turnover, costs, plots, and factor regressions | yes | yes |
The full workflow requires data that are not included in this repository:
- Sharadar / Nasdaq Data Link bulk ZIPs:
SHARADAR/SEPSHARADAR/DAILYSHARADAR/ACTIONSSHARADAR/SF1SHARADAR/TICKERS
- FRED CSV:
DGS3MO.csv
- Ken French monthly factor CSVs:
F-F_Research_Data_Factors.csvF-F_Research_Data_5_Factors_2x3.csvF-F_Momentum_Factor.csv
Place local inputs here:
data/external/nasdaq_bulk/
data/external/fred/
data/external/ken_french/
Copy the environment template, add your own Nasdaq Data Link key, and keep .env local:
cp .env.example .envThen run the main A → E pipeline:
python scripts/run_layer_a_download.py --config configs/bulk_full_template.yaml
python scripts/run_layer_b_build_panel.py --config configs/bulk_full_template.yaml
python scripts/run_layer_c_features.py --config configs/bulk_full_template.yaml
python scripts/run_layer_d_model_est.py --config configs/bulk_full_template.yaml
python scripts/run_layer_e_backtest.py --config configs/bulk_full_template.yamlFor the full empirical workflow—including the D00/D01 signal screen, post-D0 cost tuning, frictionless reporting runs, and final 15-year holdout comparisons—see docs/README_extended.md.
Sharadar SF1 fundamentals are aligned using datekey as the information-availability date, with a one-trading-day safety shift by default. Forward returns are computed separately from contemporaneous inputs. Layer D forms portfolios using information available at the rebalance date, and Layer E evaluates the realized next-period holding return.
Kim et al. use CRSP / Compustat and the standard June/July accounting convention. This project uses Sharadar as-reported data and documented Sharadar-compatible proxies where the vendor schemas differ. The goal is an independent, point-in-time implementation rather than a literal database match.
Layer E reports both a frictionless target-weight path and a cost-aware executed path. The cost-aware path uses drift-aware executed holdings, an optional no-trade band, optional gross-exposure caps, and a spread-cost proxy based on stock-level cs_spread.
The synthetic demo mimics the Layer B output contract closely enough to run Layers C, D, and E. It does not reproduce the empirical results in the project write-up.
- Kim, Soohun, Robert A. Korajczyk, and Andreas Neuhierl. 2021. “Arbitrage Portfolios.” Review of Financial Studies.
- Freyberger, Joachim, Andreas Neuhierl, and Michael Weber. 2020. “Dissecting Characteristics Nonparametrically.” Review of Financial Studies.
- Corwin, Shane A., and Paul Schultz. 2012. “A Simple Way to Estimate Bid-Ask Spreads from Daily High and Low Prices.” Journal of Finance.
- Sharadar / Nasdaq Data Link data are proprietary and are not redistributed.
- The public synthetic demo output is not economically meaningful.
- Sharadar exchange and security metadata are not a perfect historical reconstruction of CRSP listing and share-code screens.
- The delisting adjustment is an explicit Sharadar-only approximation, not a CRSP delisting-return series.
- The trading-cost model is intentionally simple and does not include borrow fees, financing, locate risk, forced buy-ins, market impact, or institutional execution constraints.
- This repository is for research review and reproducibility, not production trading.