Skip to content

danielpellatt/Arbitrage-Portfolios-Project

Repository files navigation

Arbitrage Portfolios — Python Research Pipeline and Synthetic Demo

This repository contains a Python research pipeline for an independent replication and extension of Kim, Korajczyk, and Neuhierl (2021), Arbitrage Portfolios. The project is designed to show the full research workflow behind a cross-sectional asset-pricing strategy: data organization, point-in-time panel construction, feature engineering, rolling portfolio formation, and backtest evaluation.

This repository includes a synthetic-data demo so reviewers can run the core pipeline without proprietary Sharadar / Nasdaq Data Link data.

This is research code, not investment advice. The synthetic demo is for software and workflow review only; its numeric output is not evidence of an investable trading strategy.

For the empirical write-up and results summary, see the project page: https://danielpellatt.github.io/project.html. For the full real-data reproduction workflow, see docs/README_extended.md.


Quick synthetic demo — no Sharadar account required

The demo starts from the committed synthetic Layer-B-style inputs under:

synthetic_demo/data/

It runs:

Layer C: feature construction
Layer D: rolling Kim et al. portfolio-weight estimation
Layer E: backtest summaries, plots, turnover/exposure diagnostics, and factor regressions

It intentionally skips:

Layer A: proprietary Sharadar data ingestion
Layer B: proprietary-data panel construction
D0 / cost tuning: research orchestration used in the full empirical workflow

1. Set up Python

Run these commands from the repository root. Python 3.11 through 3.14 is supported by the project metadata.

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

On Windows PowerShell:

py -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

Optional: run tests

After installing the development dependencies:

python -m pytest -q

2. Run the demo

python scripts/run_synthetic_demo.py

The runner sets ALPHA_PORTFOLIO_PROJECT_DATA_DIR=synthetic_demo/data for the current process, computes the demo feature subset, estimates kim_etal weights, and writes evaluation outputs under the synthetic demo data folder.

3. Inspect outputs

Key generated files are written under:

synthetic_demo/data/processed/

Useful files to inspect:

panel_combined_M.parquet/                      # Layer C processed feature panel
panel_combined_M_feature_diagnostics.parquet   # feature coverage diagnostics
weights_M.parquet                              # Layer D portfolio weights
rebalance_diagnostics_M.parquet                # Layer D eligibility / universe diagnostics
layer_e_M/summary_by_window.csv                # return summaries
layer_e_M/factor_regression_summary.csv        # factor-regression alpha summary
layer_e_M/factor_regression_coefficients.csv   # regression coefficients
layer_e_M/plots/                               # Layer E plot bundle

The public demo uses a smaller explicit feature set so it runs quickly and remains easy to inspect:

LME, BEME, A2ME, S2P, Debt2P, ROA, Investment,
r2_1, r6_2, r12_2, r12_7, r36_13,
LTurnover, Total_vol, Ret_max, Rel_to_High, Spread, SUV

The synthetic data are internally connected: each synthetic stock has one stable permaticker, one synthetic ticker, daily prices, monthly panel rows, fundamental-style fields, risk-free fields, and parser-compatible synthetic FRED / Ken French files. The data are artificial and should only be used to verify the repository structure and pipeline execution.


What the project demonstrates

This project is meant to demonstrate practical quantitative research engineering:

  • point-in-time data handling and leakage controls;
  • modular pipeline layers rather than one-off notebooks;
  • implementation of an academic asset-pricing model from scratch;
  • feature construction for a cross-sectional return-prediction panel;
  • rolling-window portfolio estimation;
  • transaction-cost-aware backtest diagnostics;
  • factor regressions against CAPM, Fama-French 3-factor, Fama-French 5-factor, and momentum-augmented models;
  • a separated model-selection / validation / holdout workflow for the nonlinear extensions.

The full real-data workflow uses proprietary Sharadar / Nasdaq Data Link data, local FRED CSVs, and local Ken French factor CSVs. The synthetic demo replaces those proprietary inputs with synthetic Layer-B-style inputs.


Repository layout

src/alpha_portfolio_project/
  data/          # Layer A helpers, vendor ZIP ingest, FRED / risk-free utilities
  panels/        # Layer B panel construction and point-in-time utilities
  features/      # Layer C feature registry and feature engine
  models/        # Layer D estimator implementations
  pipelines/     # runnable pipeline layers A/B/C/D0/D/E
  utils/         # shared helpers

scripts/         # terminal entry points
configs/         # YAML configs for demo, full runs, D0, cost tuning, and final reports
docs/            # extended documentation, feature catalog, synthetic schema, model note
data/            # local-only real-data cache; gitignored
synthetic_demo/  # committed synthetic data bundle for the public demo

Pipeline overview

Layer Purpose Public demo? Full real-data run?
A Ingest Sharadar bulk ZIPs into local Parquet skipped yes
B Build point-in-time panel inputs, returns, risk-free fields, fundamentals, delisting adjustments skipped; synthetic replacement provided yes
C Compute stock characteristics / features yes yes
D Estimate rolling portfolio weights yes, kim_etal only yes
D0 Signal-selection orchestration for D00 / D01 not part of quick demo optional but used in the full research workflow
D cost tuning Post-D0 implementation sweep over volatility targets, no-trade bands, and gross caps not part of quick demo optional but used in the full research workflow
E Backtest returns, exposures, turnover, costs, plots, and factor regressions yes yes

Full run with real Sharadar / FRED / Ken French data

The full workflow requires data that are not included in this repository:

  1. Sharadar / Nasdaq Data Link bulk ZIPs:
    • SHARADAR/SEP
    • SHARADAR/DAILY
    • SHARADAR/ACTIONS
    • SHARADAR/SF1
    • SHARADAR/TICKERS
  2. FRED CSV:
    • DGS3MO.csv
  3. Ken French monthly factor CSVs:
    • F-F_Research_Data_Factors.csv
    • F-F_Research_Data_5_Factors_2x3.csv
    • F-F_Momentum_Factor.csv

Place local inputs here:

data/external/nasdaq_bulk/
data/external/fred/
data/external/ken_french/

Copy the environment template, add your own Nasdaq Data Link key, and keep .env local:

cp .env.example .env

Then run the main A → E pipeline:

python scripts/run_layer_a_download.py --config configs/bulk_full_template.yaml
python scripts/run_layer_b_build_panel.py --config configs/bulk_full_template.yaml
python scripts/run_layer_c_features.py --config configs/bulk_full_template.yaml
python scripts/run_layer_d_model_est.py --config configs/bulk_full_template.yaml
python scripts/run_layer_e_backtest.py --config configs/bulk_full_template.yaml

For the full empirical workflow—including the D00/D01 signal screen, post-D0 cost tuning, frictionless reporting runs, and final 15-year holdout comparisons—see docs/README_extended.md.


Key design choices

Point-in-time timing

Sharadar SF1 fundamentals are aligned using datekey as the information-availability date, with a one-trading-day safety shift by default. Forward returns are computed separately from contemporaneous inputs. Layer D forms portfolios using information available at the rebalance date, and Layer E evaluates the realized next-period holding return.

Sharadar implementation rather than exact CRSP/Compustat reproduction

Kim et al. use CRSP / Compustat and the standard June/July accounting convention. This project uses Sharadar as-reported data and documented Sharadar-compatible proxies where the vendor schemas differ. The goal is an independent, point-in-time implementation rather than a literal database match.

Transaction-cost diagnostics

Layer E reports both a frictionless target-weight path and a cost-aware executed path. The cost-aware path uses drift-aware executed holdings, an optional no-trade band, optional gross-exposure caps, and a spread-cost proxy based on stock-level cs_spread.

Synthetic demo scope

The synthetic demo mimics the Layer B output contract closely enough to run Layers C, D, and E. It does not reproduce the empirical results in the project write-up.


References

  • Kim, Soohun, Robert A. Korajczyk, and Andreas Neuhierl. 2021. “Arbitrage Portfolios.” Review of Financial Studies.
  • Freyberger, Joachim, Andreas Neuhierl, and Michael Weber. 2020. “Dissecting Characteristics Nonparametrically.” Review of Financial Studies.
  • Corwin, Shane A., and Paul Schultz. 2012. “A Simple Way to Estimate Bid-Ask Spreads from Daily High and Low Prices.” Journal of Finance.

Notes and limitations

  • Sharadar / Nasdaq Data Link data are proprietary and are not redistributed.
  • The public synthetic demo output is not economically meaningful.
  • Sharadar exchange and security metadata are not a perfect historical reconstruction of CRSP listing and share-code screens.
  • The delisting adjustment is an explicit Sharadar-only approximation, not a CRSP delisting-return series.
  • The trading-cost model is intentionally simple and does not include borrow fees, financing, locate risk, forced buy-ins, market impact, or institutional execution constraints.
  • This repository is for research review and reproducibility, not production trading.

About

Point-in-time Python equity research pipeline for replicating and extending Kim et al. (2021) Arbitrage Portfolios, with a runnable synthetic-data demo.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages