VEP_DNA

Using DNA sequence models to compute Variant Effect Predictions (VEPs) across biobank-scale populations.

Overview

This repository provides a pipeline and analysis toolkit for running DNA-based variant effect predictors (e.g., SpliceAI, Flashzoi/Borzoi, DECIMA, Evo2) on population-scale genomic data. It integrates with GenVarLoader (GVL) to load haplotypes and reference data, runs VEP models on clinical variant sites (e.g., ClinVar), and supports downstream analysis such as variant attribution, epistasis testing, and population-specific effects.

Key capabilities:

Multi-model VEP pipeline: Run SpliceAI, Flashzoi, DECIMA, Evo2, and related models via a unified interface (src/vep_pipeline.py).
Population-scale data: Use 1000 Genomes and similar cohorts via GVL; scripts and notebooks cover data download and metadata.
Variant attribution & epistasis: Relate wild-type (WT) variants to clinical VEP scores (e.g., Ridge regression, epistasis tests); methods are documented in docs/.
Splicing & UTR resources: SpliceVarDB integration, splicing region definitions, and ClinVar UTR variant workflows.
Visualization & analysis: UMAP, Datashader, and custom plotting for VEP outputs and population structure.

Each model typically has its own conda environment (see Environment creation); the pipeline is designed so you activate the relevant env and run the corresponding notebooks or scripts.

Repository structure

VEP_DNA/
├── conda/                 # Conda environment definitions (.yml)
├── data/                  # Reference data, ClinVar, SpliceVarDB, 1KG metadata
│   ├── 1KG/
│   ├── splicing/
│   └── UTR/
├── docs/                  # Method descriptions (epistasis, etc.)
├── metadata/              # IGSR population/sample metadata
├── manuscript/            # Manuscript PDFs
├── notebooks/             # Jupyter notebooks (see Notebooks section)
├── scripts/               # Standalone scripts (SLURM, visualization, pipelines)
├── src/                   # Python package: pipeline, models, analysis, utils
│   ├── analysis/          # Attribution, matrices
│   ├── benchmark/         # Benchmarking (e.g. ClinVar)
│   ├── SpliceAI/          # SpliceAI-related code and data
│   └── ...                # clinvar, onekg, vep_pipeline, flashzoi, etc.
├── example_usage.py       # Example for genomic_image_cnn
├── genomic_image_cnn.py   # CNN on genomic embeddings (see README_genomic_cnn.md)
├── plot_splice_regions.py # Splicing region diagrams
└── README.md

conda/: One YAML per environment (e.g. spliceai.yml, flashzoi.yml, gvl.yml). See conda/README.md for creation and optional HPC/CUDA notes.
data/: Preprocessed or downloaded inputs (e.g. ClinVar VCF, SpliceVarDB×ClinVar tables, 1KG metadata). Some assets are generated by notebooks; see notebook docs for download links.
docs/: Methods (e.g. epistasis, linear model for joint effects). See docs/methods_epistasis.md and docs/epistasis_assumptions_comparison.md.
notebooks/: Step-by-step examples for each model, data prep, and analysis (see Notebooks).
scripts/: SLURM submission, haplotype analysis, UMAP/datashader demos, splicing region generation.
src/: Core Python code; run from repo root so that import src.* works (notebooks set os.chdir to repo root).

Getting started

Clone the repo (and ensure you have conda/mamba available).
Create an environment for the model you want to use (see Environment creation).
Install GenVarLoader if you will use 1000 Genomes or other GVL datasets (see conda/README.md; the pipeline notebooks use genvarloader).
Run from the repo root: Jupyter notebooks assume the working directory is the repository root so that import src.vep_pipeline, src.onekg, etc. work. The notebooks set this automatically on first run.
Pick a notebook from the table below that matches your goal (e.g. pipeline run, attribution, or data download).

Minimal example (after activating an env with dependencies):

cd /path/to/VEP_DNA
conda activate flashzoi   # or spliceai, evo2, etc.
jupyter notebook notebooks/vep_dna_pipeline.ipynb

For the genomic image CNN (embeddings-as-images), see README_genomic_cnn.md and example_usage.py.

Environment creation

Conda environment files live in conda/. Create and activate an environment with:

conda env create -f conda/<name>.yml
conda activate <name>

Environment file	Purpose
`conda.yml`	Base/general (pandas, numpy, torch, plotting).
`spliceai.yml`	SpliceAI + GenVarLoader.
`flashzoi.yml`	Flashzoi/Borzoi.
`decima.yml`	DECIMA.
`evo2.yml`	Evo2.
`dnabert.yml`	DNABERT.
`gvl.yml`	GenVarLoader-focused.
`xarray.yml`	xarray/zarr for array-backed results.
`multimolecule.yml`	Multimolecule/splice-related stack.

On HPC with CUDA via EasyBuild, you may need to load CUDA before creating envs, e.g. module load EBModules CUDA/11.7.0 (see conda/README.md).
GenVarLoader can be installed with pip install git+https://github.com/mcvickerlab/GenVarLoader.git; some envs already include it.
For GenVarLoader development (Pixi), see the “Installation GenVarLoader” section in conda/README.md.

Notebooks

Notebooks are in notebooks/. Run them from the repository root (they set the working directory automatically). The following list links to each notebook and summarizes its purpose.

Notebook	Description
vep_dna_pipeline.ipynb	Core pipeline: run VEP models (with GVL) on sites; install notes for GenVarLoader and xarray; uses Geuvadis chr22 example.
vep_analysis.ipynb	VEP analysis: import and analyze non-null VEP results; integrate with ClinVar, onekg, tskit, and benchmarking.
vep_population_analysis.ipynb	Population-specific effects: pathogenic UTR variants and population structure.
variant_attribution_SpliceAI.ipynb	SpliceAI attribution: haplotype × WT variant matrix, VEP scores, Ridge-based joint effects, epistasis (within-model vs across-model); splicing region definitions.
variant_attribution_Flashzoi.ipynb	Flashzoi attribution: variant attribution workflow for Borzoi/Flashzoi (logits, PCA, cosine similarity, batching).
flashzoi.ipynb	Flashzoi/Borzoi: run model, PCA on track deltas, cosine similarity (logits and PCA), batch processing, track metadata.
DECIMA.ipynb	DECIMA: DECIMA VEP and GenVarLoader tracks; links to DECIMA API and tutorials.
Evo2.ipynb	Evo2: environment setup and usage for Evo2.
GVL.ipynb	GenVarLoader: load and work with GVL datasets (e.g. 1000 Genomes).
data_downloaders.ipynb	Data download: 1000 Genomes collections, FTP/manifest, VCF discovery via `onekg`.
splicevardb.ipynb	SpliceVarDB (lowercase): download and explore SpliceVarDB data.
SpliceVarDB.ipynb	SpliceVarDB (capitalized): SpliceVarDB × ClinVar overlap; download links for SpliceVarDB, ClinVar VCF, GENCODE exons; output `splicevardb_x_clinvar_*`.
UTRVar.ipynb	UTR variants: ClinVar UTR SNVs; load/filter/format with `src.clinvar` for downstream (e.g. GVL).
QTL_annotations.ipynb	QTL annotations: compare QTL effect sizes with REF vs non-REF VEPs; tissue matching and overlap caveats.
phyloGPN.ipynb	phyloGPN: example DNA sequences and phyloGPN (GPN) model/tokenizer usage.
datashader.ipynb	Datashader: large-scale visualization with Datashader/Bokeh/Panel; env setup in first cell.
COVR_test.ipynb	COVR / environment test: GPU and path setup for a COVR-related run.

Additional resources

Methods and epistasis: docs/methods_epistasis.md, docs/epistasis_assumptions_comparison.md.
Genomic Image CNN: README_genomic_cnn.md, genomic_image_cnn.py, example_usage.py.
Conda and GVL: conda/README.md.
SpliceAI subpackage: src/SpliceAI/.
Tree collections (GVL): src/README_tree_collections.md.

Data and external links

1000 Genomes: data portal; manifests and VCFs via onekg (see data_downloaders.ipynb).
ClinVar: GRCh38 VCF.
SpliceVarDB: SpliceVarDB — download and processing in SpliceVarDB.ipynb.
GenVarLoader: GitHub.

If you use this repository, please cite the relevant model papers (SpliceAI, Borzoi/Flashzoi, DECIMA, Evo2, etc.) and GenVarLoader as appropriate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEP_DNA

Overview

Repository structure

Getting started

Environment creation

Notebooks

Additional resources

Data and external links

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.vscode		.vscode
conda		conda
data		data
docs		docs
manuscript		manuscript
metadata		metadata
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
README_genomic_cnn.md		README_genomic_cnn.md
example_usage.py		example_usage.py
genomic_image_cnn.py		genomic_image_cnn.py
onnx.pb		onnx.pb
plot_splice_regions.py		plot_splice_regions.py
splice_site_regions_diagram.png		splice_site_regions_diagram.png

Folders and files

Latest commit

History

Repository files navigation

VEP_DNA

Overview

Repository structure

Getting started

Environment creation

Notebooks

Additional resources

Data and external links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages