Skip to content

Commit 60d4bca

Browse files
authored
Merge pull request #1178: Optimize rule combine_samples
2 parents 137ffc4 + 39a781b commit 60d4bca

8 files changed

Lines changed: 14 additions & 6 deletions

File tree

docs/src/reference/change_log.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ We also use this change log to document new features that maintain backward comp
55

66
## New features since last version update
77

8+
- 29 July 2025: Improved performance of calls to `augur filter`. This requires a minimum Augur version of 31.3.0. [PR 1178](https://github.com/nextstrain/ncov/pull/1178)
9+
810
## v17 (17 July 2025)
911

1012
- 17 July 2025: Snakemake version 8 (or above) is now required. Various aspects of the workflow were incompatible with v8 including our support for remote files and have now been updated. The nextstrain runtimes have been correspondingly updated; see [the nextstrain-cli docs for how to upgrade these](https://docs.nextstrain.org/projects/cli/en/stable/commands/update/). [PR 1180](https://github.com/nextstrain/ncov/pull/1180)

nextstrain_profiles/100k/config-gisaid.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ inputs:
1212
metadata: "s3://nextstrain-ncov-private/metadata.tsv.zst"
1313
aligned: "s3://nextstrain-ncov-private/sequences.fasta.zst"
1414
skip_sanitize_metadata: true
15+
deduplicated: true
1516

1617
builds:
1718
100k:

nextstrain_profiles/100k/config-open.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ inputs:
1010
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.zst"
1111
aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.zst"
1212
skip_sanitize_metadata: true
13+
deduplicated: true
1314
builds:
1415
100k:
1516
subsampling_scheme: 100k_scheme

nextstrain_profiles/nextstrain-gisaid-21L/builds.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ inputs:
3434
metadata: "results/gisaid_21L_metadata.tsv.zst"
3535
aligned: "results/gisaid_21L_aligned.fasta.zst"
3636
skip_sanitize_metadata: true
37+
deduplicated: true
3738

3839
# Define locations for which builds should be created.
3940
# For each build we specify a subsampling scheme via an explicit key.

nextstrain_profiles/nextstrain-gisaid/builds.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ inputs:
2727
metadata: "s3://nextstrain-ncov-private/metadata.tsv.zst"
2828
aligned: "s3://nextstrain-ncov-private/aligned.fasta.zst"
2929
skip_sanitize_metadata: true
30+
deduplicated: true
3031

3132
# Define locations for which builds should be created.
3233
# For each build we specify a subsampling scheme via an explicit key.

nextstrain_profiles/nextstrain-open/builds.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ inputs:
2727
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.zst"
2828
aligned: "s3://nextstrain-data/files/ncov/open/aligned.fasta.zst"
2929
skip_sanitize_metadata: true
30+
deduplicated: true
3031

3132
# Define locations for which builds should be created.
3233
# For each build we specify a subsampling scheme via an explicit key.

workflow/schemas/config.schema.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ properties:
2929
minLength: 1
3030
skip_sanitize_metadata:
3131
type: boolean
32+
deduplicated:
33+
type: boolean
3234
additionalProperties: false
3335

3436
builds:

workflow/snakemake_rules/main_workflow.smk

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -424,27 +424,26 @@ def _get_subsampled_files(wildcards):
424424
]
425425

426426
rule combine_samples:
427-
message:
428-
"""
429-
Combine and deduplicate FASTAs
430-
"""
431427
input:
432428
sequences=_get_unified_alignment,
433429
metadata=_get_unified_metadata,
434430
include=_get_subsampled_files,
435431
output:
436432
sequences = "results/{build_name}/{build_name}_subsampled_sequences.fasta.xz",
437433
metadata = "results/{build_name}/{build_name}_subsampled_metadata.tsv.xz"
434+
params:
435+
skip_checks = lambda w: "--skip-checks" if all(input_record.get("deduplicated", False) for input_record in config["inputs"].values()) else ""
438436
log:
439-
"logs/subsample_regions_{build_name}.txt"
437+
"logs/combine_samples_{build_name}.txt"
440438
benchmark:
441-
"benchmarks/subsample_regions_{build_name}.txt"
439+
"benchmarks/combine_samples_{build_name}.txt"
442440
conda: config["conda_environment"]
443441
shell:
444442
r"""
445443
augur filter \
446444
--sequences {input.sequences} \
447445
--metadata {input.metadata} \
446+
{params.skip_checks} \
448447
--exclude-all \
449448
--include {input.include} \
450449
--output-sequences {output.sequences} \

0 commit comments

Comments
 (0)