Skip to content

Commit e0fd398

Browse files
authored
Merge pull request #53 from nextstrain/nextstrain-bot/update-vendored
[bot] Update ingest/vendored
2 parents 2d06e5d + 2d3b018 commit e0fd398

24 files changed

Lines changed: 553 additions & 95 deletions

ingest/build-configs/nextstrain-automation/upload.smk

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ rule upload_to_s3:
3030
cloudfront_domain=config["cloudfront_domain"],
3131
shell:
3232
r"""
33-
./vendored/upload-to-s3 \
33+
./vendored/scripts/upload-to-s3 \
3434
{params.quiet} \
3535
{input.file_to_upload:q} \
3636
{params.s3_dst:q}/{wildcards.remote_file:q} \

ingest/rules/fetch_from_ncbi.smk

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ rule fetch_ncbi_entrez_data:
9595
r"""
9696
exec &> >(tee {log:q})
9797
98-
vendored/fetch-from-ncbi-entrez \
98+
vendored/scripts/fetch-from-ncbi-entrez \
9999
--term {params.term:q} \
100100
--output {output.genbank:q}
101101
"""

ingest/vendored/.github/dependabot.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
# Each ecosystem is checked on a scheduled interval defined below. To trigger
55
# a check manually, go to
66
#
7-
# https://github.com/nextstrain/ingest/network/updates
7+
# https://github.com/nextstrain/shared/network/updates
88
#
99
# and look for a "Check for updates" button. You may need to click around a
1010
# bit first.

ingest/vendored/.github/workflows/ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@ jobs:
1111
shellcheck:
1212
runs-on: ubuntu-latest
1313
steps:
14-
- uses: actions/checkout@v4
14+
- uses: actions/checkout@v6
1515
- uses: nextstrain/.github/actions/shellcheck@master

ingest/vendored/.github/workflows/pre-commit.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ jobs:
77
pre-commit:
88
runs-on: ubuntu-latest
99
steps:
10-
- uses: actions/checkout@v4
11-
- uses: actions/setup-python@v5
10+
- uses: actions/checkout@v6
11+
- uses: actions/setup-python@v6
1212
with:
1313
python-version: "3.12"
1414
- uses: pre-commit/[email protected]

ingest/vendored/.gitrepo

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
[subrepo]
77
remote = https://github.com/nextstrain/ingest
88
branch = main
9-
commit = 258ab8ce898a88089bc88caee336f8d683a0e79a
10-
parent = a7af2c05fc4ccc822c8ef38f0001dc5e8bee803b
9+
commit = c29898f7c32c3f85d65db235d23a78e776f89120
10+
parent = 2d06e5de2f761a090d45e0bfcb8b1d510fffdc83
1111
method = merge
12-
cmdver = 0.4.7
12+
cmdver = 0.4.9

ingest/vendored/README.md

Lines changed: 40 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# ingest
1+
# shared
22

3-
Shared internal tooling for pathogen data ingest. Used by our individual
3+
Shared internal tooling for pathogen workflows. Used by our individual
44
pathogen repos which produce Nextstrain builds. Expected to be vendored by
55
each pathogen repo using `git subrepo`.
66

@@ -9,47 +9,47 @@ Some tools may only live here temporarily before finding a permanent home in
99

1010
## Vendoring
1111

12-
Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
13-
(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)
12+
Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor shared scripts.
13+
(See discussion on this decision in https://github.com/nextstrain/shared/issues/3)
1414

1515
For a list of Nextstrain repos that are currently using this method, use [this
1616
GitHub code search](https://github.com/search?type=code&q=org%3Anextstrain+subrepo+%22remote+%3D+https%3A%2F%2Fgithub.com%2Fnextstrain%2Fingest%22).
1717

1818
If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
19-
Then add the latest ingest scripts to the pathogen repo by running:
19+
Then add the latest shared scripts to the pathogen repo by running:
2020

2121
```
22-
git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
22+
git subrepo clone https://github.com/nextstrain/shared shared/vendored
2323
```
2424

25-
Any future updates of ingest scripts can be pulled in with:
25+
Any future updates of shared scripts can be pulled in with:
2626

2727
```
28-
git subrepo pull ingest/vendored
28+
git subrepo pull shared/vendored
2929
```
3030

3131
If you run into merge conflicts and would like to pull in a fresh copy of the
32-
latest ingest scripts, pull with the `--force` flag:
32+
latest shared scripts, pull with the `--force` flag:
3333

3434
```
35-
git subrepo pull ingest/vendored --force
35+
git subrepo pull shared/vendored --force
3636
```
3737

3838
> **Warning**
3939
> Beware of rebasing/dropping the parent commit of a `git subrepo` update
4040
41-
`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
41+
`git subrepo` relies on metadata in the `shared/vendored/.gitrepo` file,
4242
which includes the hash for the parent commit in the pathogen repos.
4343
If this hash no longer exists in the commit history, there will be errors when
4444
running future `git subrepo pull` commands.
4545

4646
If you run into an error similar to the following:
4747
```
48-
$ git subrepo pull ingest/vendored
49-
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
48+
$ git subrepo pull shared/vendored
49+
git-subrepo: Command failed: 'git branch subrepo/shared/vendored '.
5050
fatal: not a valid object name: ''
5151
```
52-
Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
52+
Check the parent commit hash in the `shared/vendored/.gitrepo` file and make
5353
sure the commit exists in the commit history. Update to the appropriate parent
5454
commit hash if needed.
5555

@@ -84,39 +84,49 @@ approach to "ingest" has been discussed in various internal places, including:
8484

8585
## Scripts
8686

87-
Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.
87+
Scripts for supporting workflow automation that don’t really belong in any of our existing tools.
8888

89-
- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object
90-
- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
91-
- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
92-
- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
89+
- [assign-colors](scripts/assign-colors) - Generate colors.tsv for augur export based on ordering, color schemes, and what exists in the metadata. Used in the phylogenetic or nextclade workflows.
90+
- [notify-on-diff](scripts/notify-on-diff) - Send Slack message with diff of a local file and an S3 object
91+
- [notify-on-job-fail](scripts/notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
92+
- [notify-on-job-start](scripts/notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
93+
- [notify-on-record-change](scripts/notify-on-record-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
9394
If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects.
94-
- [notify-slack](notify-slack) - Send message or file to Slack
95-
- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
96-
- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
97-
- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
95+
- [notify-slack](scripts/notify-slack) - Send message or file to Slack
96+
- [s3-object-exists](scripts/s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
97+
- [trigger](scripts/trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
98+
- [trigger-on-new-data](scripts/trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
9899
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
99100

101+
100102
NCBI interaction scripts that are useful for fetching public metadata and sequences.
101103

102-
- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
104+
- [fetch-from-ncbi-entrez](scripts/fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
103105
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
104106

105-
Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.
107+
Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/shared/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/shared/issues/18.
106108

107109
Potential Nextstrain CLI scripts
108110

109-
- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
110-
- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
111+
- [sha256sum](scripts/sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
112+
- [cloudfront-invalidate](scripts/cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
111113
This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
112-
- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
114+
- [upload-to-s3](scripts/upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
113115
Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`.
114116
Adds the following user defined metadata to uploaded S3 object:
115-
- `sha256sum` - hash of the file generated by [sha256sum](sha256sum)
117+
- `sha256sum` - hash of the file generated by [sha256sum](scripts/sha256sum)
116118
- `recordcount` - the line count of the file
117-
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
119+
- [download-from-s3](scripts/download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
118120
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.
119121

122+
## Snakemake
123+
124+
Snakemake workflow functions that are shared across many pathogen workflows that don’t really belong in any of our existing tools.
125+
126+
- [config.smk](snakemake/config.smk) - Shared functions for handling workflow configs.
127+
- [remote_files.smk](snakemake/remote_files.smk) - Exposes the `path_or_url` function which will use Snakemake's storage plugins to download/upload files to remote providers as needed.
128+
129+
120130
## Software requirements
121131

122132
Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.

ingest/vendored/notify-slack

Lines changed: 0 additions & 56 deletions
This file was deleted.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Generate colors.tsv for augur export based on ordering, color schemes, and
4+
traits that exists in the metadata.
5+
"""
6+
import argparse
7+
import pandas as pd
8+
9+
if __name__ == '__main__':
10+
parser = argparse.ArgumentParser(
11+
description="Assign colors based on defined ordering of traits.",
12+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
13+
)
14+
15+
parser.add_argument('--ordering', type=str, required=True,
16+
help="""Input TSV file defining the color ordering where the first
17+
column is the field and the second column is the trait in that field.
18+
Blank lines are ignored. Lines starting with '#' will be ignored as comments.""")
19+
parser.add_argument('--color-schemes', type=str, required=True,
20+
help="Input color schemes where each line is a different color scheme separated by tabs.")
21+
parser.add_argument('--metadata', type=str,
22+
help="""If provided, restrict colors to only those traits found in
23+
metadata. If the metadata includes a 'focal' column that only contains
24+
boolean values, then restrict colors to traits for rows where 'focal'
25+
is set to True.""")
26+
parser.add_argument('--output', type=str, required=True,
27+
help="Output colors TSV file to be passed to augur export.")
28+
args = parser.parse_args()
29+
30+
assignment = {}
31+
with open(args.ordering) as f:
32+
for line in f.readlines():
33+
array = line.strip().split("\t")
34+
# Ignore empty lines or commented lines
35+
if not array or not array[0] or array[0].startswith('#'):
36+
continue
37+
# Throw a warning if encountering a line not matching the expected number of columns, ignore line
38+
elif len(array)!=2:
39+
print(f"WARNING: Could not decode color ordering line: {line}")
40+
continue
41+
# Otherwise, process color ordering where we expect 2 columns: name, traits
42+
else:
43+
name = array[0]
44+
trait = array[1]
45+
if name not in assignment:
46+
assignment[name] = [trait]
47+
else:
48+
assignment[name].append(trait)
49+
50+
# if metadata supplied, go through and
51+
# 1. remove assignments that don't exist in metadata
52+
# 2. remove assignments that have 'focal' set to 'False' in metadata
53+
if args.metadata:
54+
metadata = pd.read_csv(args.metadata, delimiter='\t')
55+
for name, trait in assignment.items():
56+
if name in metadata:
57+
if 'focal' in metadata and metadata['focal'].dtype == 'bool':
58+
focal_list = metadata.loc[metadata['focal'], name].unique()
59+
subset_focal = [x for x in assignment[name] if x in focal_list]
60+
assignment[name] = subset_focal
61+
else: # no 'focal' present
62+
subset_present = [x for x in assignment[name] if x in metadata[name].unique()]
63+
assignment[name] = subset_present
64+
65+
66+
schemes = {}
67+
counter = 0
68+
with open(args.color_schemes) as f:
69+
for line in f.readlines():
70+
counter += 1
71+
array = line.lstrip().rstrip().split("\t")
72+
schemes[counter] = array
73+
74+
with open(args.output, 'w') as f:
75+
for trait_name, trait_array in assignment.items():
76+
if len(trait_array)==0:
77+
print(f"No traits found for {trait_name}")
78+
continue
79+
if len(schemes)<len(trait_array):
80+
print(f"WARNING: insufficient colours available for trait {trait_name} - reusing colours!")
81+
remain = len(trait_array)
82+
color_array = []
83+
while(remain>0):
84+
if (remain>len(schemes)):
85+
color_array = [*color_array, *schemes[len(schemes)]]
86+
remain -= len(schemes)
87+
else:
88+
color_array = [*color_array, *schemes[remain]]
89+
remain = 0
90+
else:
91+
color_array = schemes[len(trait_array)]
92+
93+
zipped = list(zip(trait_array, color_array))
94+
for trait_value, color in zipped:
95+
f.write(trait_name + "\t" + trait_value + "\t" + color + "\n")
96+
f.write("\n")
File renamed without changes.

0 commit comments

Comments
 (0)