Using DNA sequence models to compute Variant Effect Predictions (VEPs) across biobank-scale populations.
This repository provides a pipeline and analysis toolkit for running DNA-based variant effect predictors (e.g., SpliceAI, Flashzoi/Borzoi, DECIMA, Evo2) on population-scale genomic data. It integrates with GenVarLoader (GVL) to load haplotypes and reference data, runs VEP models on clinical variant sites (e.g., ClinVar), and supports downstream analysis such as variant attribution, epistasis testing, and population-specific effects.
Key capabilities:
- Multi-model VEP pipeline: Run SpliceAI, Flashzoi, DECIMA, Evo2, and related models via a unified interface (
src/vep_pipeline.py). - Population-scale data: Use 1000 Genomes and similar cohorts via GVL; scripts and notebooks cover data download and metadata.
- Variant attribution & epistasis: Relate wild-type (WT) variants to clinical VEP scores (e.g., Ridge regression, epistasis tests); methods are documented in
docs/. - Splicing & UTR resources: SpliceVarDB integration, splicing region definitions, and ClinVar UTR variant workflows.
- Visualization & analysis: UMAP, Datashader, and custom plotting for VEP outputs and population structure.
Each model typically has its own conda environment (see Environment creation); the pipeline is designed so you activate the relevant env and run the corresponding notebooks or scripts.
VEP_DNA/
├── conda/ # Conda environment definitions (.yml)
├── data/ # Reference data, ClinVar, SpliceVarDB, 1KG metadata
│ ├── 1KG/
│ ├── splicing/
│ └── UTR/
├── docs/ # Method descriptions (epistasis, etc.)
├── metadata/ # IGSR population/sample metadata
├── manuscript/ # Manuscript PDFs
├── notebooks/ # Jupyter notebooks (see Notebooks section)
├── scripts/ # Standalone scripts (SLURM, visualization, pipelines)
├── src/ # Python package: pipeline, models, analysis, utils
│ ├── analysis/ # Attribution, matrices
│ ├── benchmark/ # Benchmarking (e.g. ClinVar)
│ ├── SpliceAI/ # SpliceAI-related code and data
│ └── ... # clinvar, onekg, vep_pipeline, flashzoi, etc.
├── example_usage.py # Example for genomic_image_cnn
├── genomic_image_cnn.py # CNN on genomic embeddings (see README_genomic_cnn.md)
├── plot_splice_regions.py # Splicing region diagrams
└── README.md
conda/: One YAML per environment (e.g.spliceai.yml,flashzoi.yml,gvl.yml). See conda/README.md for creation and optional HPC/CUDA notes.data/: Preprocessed or downloaded inputs (e.g. ClinVar VCF, SpliceVarDB×ClinVar tables, 1KG metadata). Some assets are generated by notebooks; see notebook docs for download links.docs/: Methods (e.g. epistasis, linear model for joint effects). See docs/methods_epistasis.md and docs/epistasis_assumptions_comparison.md.notebooks/: Step-by-step examples for each model, data prep, and analysis (see Notebooks).scripts/: SLURM submission, haplotype analysis, UMAP/datashader demos, splicing region generation.src/: Core Python code; run from repo root so thatimport src.*works (notebooks setos.chdirto repo root).
- Clone the repo (and ensure you have conda/mamba available).
- Create an environment for the model you want to use (see Environment creation).
- Install GenVarLoader if you will use 1000 Genomes or other GVL datasets (see conda/README.md; the pipeline notebooks use
genvarloader). - Run from the repo root: Jupyter notebooks assume the working directory is the repository root so that
import src.vep_pipeline,src.onekg, etc. work. The notebooks set this automatically on first run. - Pick a notebook from the table below that matches your goal (e.g. pipeline run, attribution, or data download).
Minimal example (after activating an env with dependencies):
cd /path/to/VEP_DNA
conda activate flashzoi # or spliceai, evo2, etc.
jupyter notebook notebooks/vep_dna_pipeline.ipynbFor the genomic image CNN (embeddings-as-images), see README_genomic_cnn.md and example_usage.py.
Conda environment files live in conda/. Create and activate an environment with:
conda env create -f conda/<name>.yml
conda activate <name>| Environment file | Purpose |
|---|---|
conda.yml |
Base/general (pandas, numpy, torch, plotting). |
spliceai.yml |
SpliceAI + GenVarLoader. |
flashzoi.yml |
Flashzoi/Borzoi. |
decima.yml |
DECIMA. |
evo2.yml |
Evo2. |
dnabert.yml |
DNABERT. |
gvl.yml |
GenVarLoader-focused. |
xarray.yml |
xarray/zarr for array-backed results. |
multimolecule.yml |
Multimolecule/splice-related stack. |
- On HPC with CUDA via EasyBuild, you may need to load CUDA before creating envs, e.g.
module load EBModules CUDA/11.7.0(see conda/README.md). - GenVarLoader can be installed with
pip install git+https://github.com/mcvickerlab/GenVarLoader.git; some envs already include it. - For GenVarLoader development (Pixi), see the “Installation GenVarLoader” section in conda/README.md.
Notebooks are in notebooks/. Run them from the repository root (they set the working directory automatically). The following list links to each notebook and summarizes its purpose.
| Notebook | Description |
|---|---|
| vep_dna_pipeline.ipynb | Core pipeline: run VEP models (with GVL) on sites; install notes for GenVarLoader and xarray; uses Geuvadis chr22 example. |
| vep_analysis.ipynb | VEP analysis: import and analyze non-null VEP results; integrate with ClinVar, onekg, tskit, and benchmarking. |
| vep_population_analysis.ipynb | Population-specific effects: pathogenic UTR variants and population structure. |
| variant_attribution_SpliceAI.ipynb | SpliceAI attribution: haplotype × WT variant matrix, VEP scores, Ridge-based joint effects, epistasis (within-model vs across-model); splicing region definitions. |
| variant_attribution_Flashzoi.ipynb | Flashzoi attribution: variant attribution workflow for Borzoi/Flashzoi (logits, PCA, cosine similarity, batching). |
| flashzoi.ipynb | Flashzoi/Borzoi: run model, PCA on track deltas, cosine similarity (logits and PCA), batch processing, track metadata. |
| DECIMA.ipynb | DECIMA: DECIMA VEP and GenVarLoader tracks; links to DECIMA API and tutorials. |
| Evo2.ipynb | Evo2: environment setup and usage for Evo2. |
| GVL.ipynb | GenVarLoader: load and work with GVL datasets (e.g. 1000 Genomes). |
| data_downloaders.ipynb | Data download: 1000 Genomes collections, FTP/manifest, VCF discovery via onekg. |
| splicevardb.ipynb | SpliceVarDB (lowercase): download and explore SpliceVarDB data. |
| SpliceVarDB.ipynb | SpliceVarDB (capitalized): SpliceVarDB × ClinVar overlap; download links for SpliceVarDB, ClinVar VCF, GENCODE exons; output splicevardb_x_clinvar_*. |
| UTRVar.ipynb | UTR variants: ClinVar UTR SNVs; load/filter/format with src.clinvar for downstream (e.g. GVL). |
| QTL_annotations.ipynb | QTL annotations: compare QTL effect sizes with REF vs non-REF VEPs; tissue matching and overlap caveats. |
| phyloGPN.ipynb | phyloGPN: example DNA sequences and phyloGPN (GPN) model/tokenizer usage. |
| datashader.ipynb | Datashader: large-scale visualization with Datashader/Bokeh/Panel; env setup in first cell. |
| COVR_test.ipynb | COVR / environment test: GPU and path setup for a COVR-related run. |
- Methods and epistasis: docs/methods_epistasis.md, docs/epistasis_assumptions_comparison.md.
- Genomic Image CNN: README_genomic_cnn.md,
genomic_image_cnn.py,example_usage.py. - Conda and GVL: conda/README.md.
- SpliceAI subpackage: src/SpliceAI/.
- Tree collections (GVL): src/README_tree_collections.md.
- 1000 Genomes: data portal; manifests and VCFs via
onekg(see data_downloaders.ipynb). - ClinVar: GRCh38 VCF.
- SpliceVarDB: SpliceVarDB — download and processing in SpliceVarDB.ipynb.
- GenVarLoader: GitHub.
If you use this repository, please cite the relevant model papers (SpliceAI, Borzoi/Flashzoi, DECIMA, Evo2, etc.) and GenVarLoader as appropriate.