To run VIPERA, an environment with Snakemake version 7.19 or later is needed (see the Snakemake docs for setup instructions).
This guide provides command-line instructions for running VIPERA with Snakemake versions prior to 8. All configuration parameters are fully cross-compatible. The original publication used Snakemake 7.32, but newer versions can also be used with only minor changes. For details, see the Snakemake migration guide. For example, existing profiles are cross-compatible as well, but note that the
--use-condaflag is deprecated starting with Snakemake 8. Instead, use--software-deployment-method conda.
- Needs:
- Snakemake 7.19 or later.
- One FASTA per sample.
- One BAM per sample.
- One metadata CSV with columns
ID,CollectionDate(YYYY-MM-DD),ResidenceCityandGISAIDEPI(can be empty).
- Setup: edit targets.yaml (set
SAMPLESandMETADATA, at least) or build it usingbuild_targets.py. LeaveCONTEXT_FASTA: nullto auto-download from GISAID (needsconfig/gisaid.yamlwith your username and password) or set a local FASTA path if download fails (see GISAID disclaimer). - Run:
snakemake (--use-conda | --sdm conda) -c4.
The workflow requires a set of FASTA files (one per target sample), a corresponding set of BAM files (also one per target sample), and a metadata table in CSV format with one row per sample. The metadata must include at least the following columns:
ID: unique sample identifier, used to match sequencing files with metadata.CollectionDate: the date the sample was collected (YYYY-MM-DD).ResidenceCity: the location where the sample was collected.GISAIDEPI: GISAID accession identifier (EPI_ISL_...or empty).
The path to these input files is set in two configuration files in YAML format:
config.yaml (for general workflow settings) and
targets.yaml (for specific dataset-related settings).
The latter must be modified by the user to point the SAMPLES and METADATA
parameters to your data. The OUTPUT_DIRECTORY parameter should point to your
desired results directory.
The script build_targets.py simplifies the process of creating
the targets configuration file. To run this script, you need to have PyYAML installed. It
takes a list of sample names, a directory with BAM and FASTA files, the path to
the metadata table and the name of your dataset as required inputs. Then, it searches the
directory for files that have the appropriate extensions and sample names and adds them
to the configuration file.
An example file could look like this:
OUTPUT_NAME:
"your-dataset-name"
SAMPLES:
sample1:
bam: "path/to/sorted/bam1.bam"
fasta: "path/to/sequence1.fasta"
sample2:
bam: "path/to/sorted/bam2.bam"
fasta: "path/to/sequence2.fasta"
...
METADATA:
"path/to/metadata.csv"
OUTPUT_DIRECTORY:
"output"
CONTEXT_FASTA:
null
MAPPING_REFERENCES_FASTA:
nullThis information may also be provided through the --config parameter.
Setting the CONTEXT_FASTA parameter to null (default) will enable
the automatic download of sequences from the GISAID EpiCoV SARS-CoV-2 database.
An unset parameter has the same effect.
To enable this, you must also sign up to the GISAID platform
and provide your credentials by creating and filling an additional configuration
file (default: config/gisaid.yaml) as follows:
USERNAME: "your-username"
PASSWORD: "your-password"A set of samples that meet the spatial, temporal and phylogenetic criteria
set through the download_context rule
will be retrieved automatically from GISAID. These criteria are:
- Location matching the place(s) of sampling of the target samples
- Collection date within the time window that includes 95% of the date distribution of the target samples (2.5% is trimmed at each end to account for extreme values) ± 2 weeks
- Pango lineage matching that of the target samples
Then, a series of checkpoint steps are executed for quality assurance:
- Remove context samples whose GISAID ID match any of the target samples
- Enforce a minimum number of samples to have at least as many possible combinations as random subsample replicates for the diversity assessment (set in config.yaml)
The workflow will continue its execution until completion if the obtained
context dataset passes these checkpoints. Otherwise, the execution will be
terminated and, to continue the analysis, an external context dataset must
be provided through the CONTEXT_FASTA parameter. This can be done
by editing targets.yaml or via the command line:
snakemake --config CONTEXT_FASTA="path/to/fasta"Important
The GISAID EpiCoV database is proprietary and not openly accessible. For details, refer
to the GISAID Terms of Use.
VIPERA uses GISAIDR to automate access to GISAID data.
However, this access can be unstable or occasionally fail due to changes in the platform.
Possible workarounds are documented in the GISAIDR repository (e.g. issues
#55 and #58).
If programmatic access fails, a suitable context dataset must be manually provided by setting the
CONTEXT_FASTA parameter to the path of a FASTA file.
As a last resort, some of the analyses can be allowed to run even if context-dependent rules
fail by passing the --keep-going flag to Snakemake.
To replicate our work, the automatic context dataset is available via
DOI: 10.55876/gis8.250718er (EPI_SET_250718er).
Read more about EPI_SETs here.
Setting MAPPING_REFERENCES_FASTA to null (default) will enable the automatic download of the
reference sequence(s) that were used to map the reads and generate the BAM files.
An unset parameter has the same effect.
If the required sequence is not available publically or the user already has it
at your disposal, it can be provided manually by setting the parameter to the
path of the reference FASTA file.
All of the following variables are pre-defined in config.yaml:
ALIGNMENT_REFERENCE: NCBI accession number of the reference record for sequence alignment.PROBLEMATIC_VCF: URL or path of a VCF file containing problematic genome positions for masking.GENETIC_CODE_JSON: path of a JSON file containing a genetic code for gene translation.TREE_MODEL: substitution model used by IQTREE (see docs).UFBOOT&SHALRT: settings for ultrafast bootstrap (see UFBoot) and Shimodaira–Hasegawa approximate likelihood ratio test bootstrap (see SH-aLRT) in IQTREE runs:REPS: number of replicates.THRESHOLD: value cutoff for visualization.
VC: variant calling configuration:MIN_QUALITY: minimum base quality forivar variants(option-q).MIN_FREQ: minimum frequency threshold forivar variants(option-t).MIN_DEPTH: minimum read depth forivar variants(option-m).MAX_DEPTH: maximum read depth forsamtools mpileup(option-d).
DEMIX: demixing configuration (uses Freyja, see also its docs):PATHOGEN: pathogen of interest forfreyja update(option--pathogen); must be 'SARS-CoV-2'.MIN_QUALITY: minimum quality forfreyja variants(option--minq).MAX_DEPTH: maximum read depth forsamtools mpileup(option-d).COV_CUTOFF: minimum depth to calculate the reported "coverage" (percent of sites with that depth) forfreyja demix(option--covcut).MIN_ABUNDANCE: minimum lineage estimated abundance forfreyja demix(option--eps).CONFIRMED_ONLY: exclude unconfirmed lineages infreyja demix(option--confirmedonly).DEPTH_CUTOFF: minimum depth on each site forfreyja demix(option--depthcutoff).RELAXED_MRCA: assign clusters using relaxed (as opposed to strict) MRCA, used withDEPTH_CUTOFF, forfreyja demix(option--relaxedmrca).RELAXED_MRCA_THRESH:RELAXED_MRCAthreshold forfreyja demix(option--relaxedthresh).AUTO_ADAPT: use error profile to set adaptive lasso penalty parameter forfreyja demix(option--autoadapt).
WINDOW: sliding window of nucleotide variants per site configuration:WIDTH: number of sites within windows.STEP: number of sites between windows.
GB_FEATURES: optional mapping to filter which features from the GenBank file are used by some analyses (e.g. ruleswindowandn_s_sites). IfGB_FEATURESis empty or unset, all features are used. Filtering is applied in order:INCLUDE: mapping of qualifier names to sequences of values. If present, only features that match at least one key/value pair inINCLUDEare included in the analyses. For example, havingINCLUDE: {gene: [S, N]}keeps features whosegenequalifier equalsSorN.EXCLUDE: mapping of qualifier names to sequences of values. AfterINCLUDEis applied, any feature that matches any key/value pair inEXCLUDEis omitted. For example, havingEXCLUDE: {gene: [S, N]}removes features whosegenequalifier equalsSorN.
ANNOTATION: settings for variant annotation and functional effect prediction using SnpEff, which usesALIGNMENT_REFERENCEfor selecting the database to annotate.SNPEFF_COLS: mapping of column names (which appear in result tables) to VCF fields (extracted after annotation with SnpSift). Some columns are hard-coded in the code, so removing them is not advised. Additional columns can be added as needed.FILTER_INCLUDE&FILTER_EXCLUDE: mapping of column names (fromSNPEFF_COLS) to lists of values used for filtering the annotated variants table.FILTER_INCLUDEis applied first, thenFILTER_EXCLUDE.FILTER_INCLUDE: keeps variants that match at least one listed value.FILTER_EXCLUDE: removes variant that matches any listed value.
VARIANT_NAME_PATTERN: string template used to build the variant name shown in the results table (columnVARIANT_NAME). The template is interpreted with glue and can use any column name fromSNPEFF_COLSand some R functions. For example,"{GENE}:{coalesce(HGVS_P, HGVS_C)}"creates names likeS:p.D614G(usingHGVS_Pwhen available, otherwiseHGVS_C).
GISAID: automatic context download configuration.CREDENTIALS: path of the GISAID credentials in YAML format.DATE_COLUMN: name of the column that contains sampling dates (YYYY-MM-DD) in the input target metadata.LOCATION_COLUMN: name of the column that contains sampling locations (e.g. city names) in the input target metadata.ACCESSION_COLUMN: name of the column that contains GISAID EPI identifiers in the input target metadata.
DIVERSITY_REPS: number of random sample subsets of the context dataset for the nucleotide diversity comparison.USE_BIONJ: use the BIONJ algorithm (Gascuel, 1997) instead of NJ (neighbor-joining; Saitou & Nei, 1987) to reconstruct phylogenetic trees from pairwise distances.COR: configuration for correlation analyses of allele frequency data over time and between variants. This parameter controls how correlation tests are performed using R'scor.testandcorfunctions (see R documentation).METHOD: correlation method to use. Valid options are "pearson" (default), "kendall", or "spearman".EXACT: boolean flag indicating whether to compute an exact p-value when possible. This option applies only to certain methods and may be set tonull(default) to let R decide automatically.
LOG_PY_FMT: logging format string for Python scripts.PLOTS: path of the R script that sets the design and style of data visualizations.PLOT_GENOME_REGIONS: path of a CSV file containing genome regions, e.g. SARS-CoV-2 non-structural protein (NSP) coordinates, for data visualization (columns:region,start,end).REPORT_QMD: path of the report template in Quarto markdown (QMD) format.REPORT_CSS: path of the report stylesheet definition in CSS format.
Snakemake enables easy visualization of workflows and rule relationships. The --rulegraph option outputs a DOT file that describes dependencies between rules. The example below produces an image using Graphviz:
snakemake --forceall --rulegraph | dot -Grankdir=LR -Tsvg >.rulegraph.svgThe same graph can also be rendered with other tools such as snakevision (v0.1.0).
snakemake --forceall --rulegraph | snakevision -s all -o .rulegraph_sv.svgThe --dag option emits an directed acyclic graph (DAG) that corresponds to the rule instances that would be executed for the target dataset. The example below produces an image using Graphviz (for data from our study):
snakemake --forceall --dag | dot -Grankdir=LR -Tsvg >.dag.svgTo run the analysis with the default configuration, run the following command
(change the -c/--cores argument to use a different number of CPUs):
snakemake --use-conda -c4To run the analysis in an HPC environment using SLURM, we provide a default SLURM profile configuration as an example that should be modified to fit your needs. Read more about Snakemake profiles here. To use the profile, install the Snakemake executor plugin for SLURM and run one of the following commands:
snakemake --slurm --profile profile/slurm # Snakemake v7
snakemake --profile profile/slurm # Snakemake v8+Additionally, we offer the option of running the workflow within a containerized
environment using a pre-built Docker image,
provided that Apptainer/Singularity
is available on the system. This eliminates the need for further conda package
downloads and environment configuration.
To do that, simply add the option --use-apptainer to any of the previous commands.
Using Apptainer for running VIPERA in the Windows Subsystem for Linux (WSL)
may encounter errors due to the default file permissions configuration, which
conflicts with Snakemake's containerized conda environment activation mechanism.
Thus, running the containerized VIPERA workflow on the WSL is not advised.
Additionally, certain known issues arise when utilizing non-default temporary
directories and Snakemake shadow directories. To address this issue, use the
default temporary directory (e.g. export TMPDIR=/tmp in Linux machines) and
specify the shadow prefix (--shadow-prefix /tmp) before executing the containerized workflow.