Skip to content

bschilder/VEP_DNA

Repository files navigation

VEP_DNA

Using DNA sequence models to compute Variant Effect Predictions (VEPs) across biobank-scale populations.


Overview

This repository provides a pipeline and analysis toolkit for running DNA-based variant effect predictors (e.g., SpliceAI, Flashzoi/Borzoi, DECIMA, Evo2) on population-scale genomic data. It integrates with GenVarLoader (GVL) to load haplotypes and reference data, runs VEP models on clinical variant sites (e.g., ClinVar), and supports downstream analysis such as variant attribution, epistasis testing, and population-specific effects.

Key capabilities:

  • Multi-model VEP pipeline: Run SpliceAI, Flashzoi, DECIMA, Evo2, and related models via a unified interface (src/vep_pipeline.py).
  • Population-scale data: Use 1000 Genomes and similar cohorts via GVL; scripts and notebooks cover data download and metadata.
  • Variant attribution & epistasis: Relate wild-type (WT) variants to clinical VEP scores (e.g., Ridge regression, epistasis tests); methods are documented in docs/.
  • Splicing & UTR resources: SpliceVarDB integration, splicing region definitions, and ClinVar UTR variant workflows.
  • Visualization & analysis: UMAP, Datashader, and custom plotting for VEP outputs and population structure.

Each model typically has its own conda environment (see Environment creation); the pipeline is designed so you activate the relevant env and run the corresponding notebooks or scripts.


Repository structure

VEP_DNA/
├── conda/                 # Conda environment definitions (.yml)
├── data/                  # Reference data, ClinVar, SpliceVarDB, 1KG metadata
│   ├── 1KG/
│   ├── splicing/
│   └── UTR/
├── docs/                  # Method descriptions (epistasis, etc.)
├── metadata/              # IGSR population/sample metadata
├── manuscript/            # Manuscript PDFs
├── notebooks/             # Jupyter notebooks (see Notebooks section)
├── scripts/               # Standalone scripts (SLURM, visualization, pipelines)
├── src/                   # Python package: pipeline, models, analysis, utils
│   ├── analysis/          # Attribution, matrices
│   ├── benchmark/         # Benchmarking (e.g. ClinVar)
│   ├── SpliceAI/          # SpliceAI-related code and data
│   └── ...                # clinvar, onekg, vep_pipeline, flashzoi, etc.
├── example_usage.py       # Example for genomic_image_cnn
├── genomic_image_cnn.py   # CNN on genomic embeddings (see README_genomic_cnn.md)
├── plot_splice_regions.py # Splicing region diagrams
└── README.md
  • conda/: One YAML per environment (e.g. spliceai.yml, flashzoi.yml, gvl.yml). See conda/README.md for creation and optional HPC/CUDA notes.
  • data/: Preprocessed or downloaded inputs (e.g. ClinVar VCF, SpliceVarDB×ClinVar tables, 1KG metadata). Some assets are generated by notebooks; see notebook docs for download links.
  • docs/: Methods (e.g. epistasis, linear model for joint effects). See docs/methods_epistasis.md and docs/epistasis_assumptions_comparison.md.
  • notebooks/: Step-by-step examples for each model, data prep, and analysis (see Notebooks).
  • scripts/: SLURM submission, haplotype analysis, UMAP/datashader demos, splicing region generation.
  • src/: Core Python code; run from repo root so that import src.* works (notebooks set os.chdir to repo root).

Getting started

  1. Clone the repo (and ensure you have conda/mamba available).
  2. Create an environment for the model you want to use (see Environment creation).
  3. Install GenVarLoader if you will use 1000 Genomes or other GVL datasets (see conda/README.md; the pipeline notebooks use genvarloader).
  4. Run from the repo root: Jupyter notebooks assume the working directory is the repository root so that import src.vep_pipeline, src.onekg, etc. work. The notebooks set this automatically on first run.
  5. Pick a notebook from the table below that matches your goal (e.g. pipeline run, attribution, or data download).

Minimal example (after activating an env with dependencies):

cd /path/to/VEP_DNA
conda activate flashzoi   # or spliceai, evo2, etc.
jupyter notebook notebooks/vep_dna_pipeline.ipynb

For the genomic image CNN (embeddings-as-images), see README_genomic_cnn.md and example_usage.py.


Environment creation

Conda environment files live in conda/. Create and activate an environment with:

conda env create -f conda/<name>.yml
conda activate <name>
Environment file Purpose
conda.yml Base/general (pandas, numpy, torch, plotting).
spliceai.yml SpliceAI + GenVarLoader.
flashzoi.yml Flashzoi/Borzoi.
decima.yml DECIMA.
evo2.yml Evo2.
dnabert.yml DNABERT.
gvl.yml GenVarLoader-focused.
xarray.yml xarray/zarr for array-backed results.
multimolecule.yml Multimolecule/splice-related stack.
  • On HPC with CUDA via EasyBuild, you may need to load CUDA before creating envs, e.g. module load EBModules CUDA/11.7.0 (see conda/README.md).
  • GenVarLoader can be installed with pip install git+https://github.com/mcvickerlab/GenVarLoader.git; some envs already include it.
  • For GenVarLoader development (Pixi), see the “Installation GenVarLoader” section in conda/README.md.

Notebooks

Notebooks are in notebooks/. Run them from the repository root (they set the working directory automatically). The following list links to each notebook and summarizes its purpose.

Notebook Description
vep_dna_pipeline.ipynb Core pipeline: run VEP models (with GVL) on sites; install notes for GenVarLoader and xarray; uses Geuvadis chr22 example.
vep_analysis.ipynb VEP analysis: import and analyze non-null VEP results; integrate with ClinVar, onekg, tskit, and benchmarking.
vep_population_analysis.ipynb Population-specific effects: pathogenic UTR variants and population structure.
variant_attribution_SpliceAI.ipynb SpliceAI attribution: haplotype × WT variant matrix, VEP scores, Ridge-based joint effects, epistasis (within-model vs across-model); splicing region definitions.
variant_attribution_Flashzoi.ipynb Flashzoi attribution: variant attribution workflow for Borzoi/Flashzoi (logits, PCA, cosine similarity, batching).
flashzoi.ipynb Flashzoi/Borzoi: run model, PCA on track deltas, cosine similarity (logits and PCA), batch processing, track metadata.
DECIMA.ipynb DECIMA: DECIMA VEP and GenVarLoader tracks; links to DECIMA API and tutorials.
Evo2.ipynb Evo2: environment setup and usage for Evo2.
GVL.ipynb GenVarLoader: load and work with GVL datasets (e.g. 1000 Genomes).
data_downloaders.ipynb Data download: 1000 Genomes collections, FTP/manifest, VCF discovery via onekg.
splicevardb.ipynb SpliceVarDB (lowercase): download and explore SpliceVarDB data.
SpliceVarDB.ipynb SpliceVarDB (capitalized): SpliceVarDB × ClinVar overlap; download links for SpliceVarDB, ClinVar VCF, GENCODE exons; output splicevardb_x_clinvar_*.
UTRVar.ipynb UTR variants: ClinVar UTR SNVs; load/filter/format with src.clinvar for downstream (e.g. GVL).
QTL_annotations.ipynb QTL annotations: compare QTL effect sizes with REF vs non-REF VEPs; tissue matching and overlap caveats.
phyloGPN.ipynb phyloGPN: example DNA sequences and phyloGPN (GPN) model/tokenizer usage.
datashader.ipynb Datashader: large-scale visualization with Datashader/Bokeh/Panel; env setup in first cell.
COVR_test.ipynb COVR / environment test: GPU and path setup for a COVR-related run.

Additional resources


Data and external links

If you use this repository, please cite the relevant model papers (SpliceAI, Borzoi/Flashzoi, DECIMA, Evo2, etc.) and GenVarLoader as appropriate.

About

Using DNA sequence models to compute Variant Effect Predictions across biobank-scale populations.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors