Skip to content

Commit 7e9774d

Browse files
authored
Merge PR #7: Medical accuracy, maintenance guidance, and repo hardening
Tighten PharmCAT and PRS maintenance guidance
2 parents cc8ccdc + af1b8c0 commit 7e9774d

16 files changed

Lines changed: 203 additions & 35 deletions

.github/FUNDING.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
github: geiserx
2+
patreon: geiser
3+
buy_me_a_coffee: geiser
4+
thanks_dev: u/gh/geiserx

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
## Summary
2+
3+
<!-- What does this PR change and why? -->
4+
5+
## Type of Change
6+
7+
- [ ] Bug fix
8+
- [ ] New feature
9+
- [ ] Documentation update
10+
- [ ] Infrastructure / CI change
11+
- [ ] Refactor / cleanup
12+
13+
## Checklist
14+
15+
- [ ] Relevant scripts/docs stay in sync
16+
- [ ] `shellcheck` passes for changed shell scripts
17+
- [ ] Smoke-test assumptions were checked for changed pipeline contracts
18+
- [ ] No personal paths, hostnames, or sample-specific defaults were introduced
19+
- [ ] No secrets or credentials were committed
20+
21+
## Testing
22+
23+
- [ ] Tested locally
24+
- [ ] CI passes
25+
- [ ] Not applicable
26+
27+
## Notes
28+
29+
<!-- Anything reviewers should pay special attention to? -->

.github/dependabot.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
version: 2
2+
3+
updates:
4+
- package-ecosystem: "github-actions"
5+
directory: "/"
6+
schedule:
7+
interval: "monthly"
8+
open-pull-requests-limit: 10
9+
commit-message:
10+
prefix: "ci"
11+
labels:
12+
- "ci"
13+
- "automated"
14+
groups:
15+
github-actions:
16+
patterns:
17+
- "*"
18+
github_actions_security:
19+
applies-to: security-updates
20+
patterns:
21+
- "*"
22+
ignore:
23+
- dependency-name: "*"
24+
update-types: ["version-update:semver-major"]

AGENTS.md

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -116,13 +116,16 @@ User's FASTQ/BAM/VCF
116116
- Preprocessor outputs `.preprocessed.vcf.bgz` (NOT `.vcf`).
117117
- JSON output structure: `genes` is `{source → {gene_name → data}}` (dict of dicts), NOT a list. `sourceDiplotypes` contains `allele1`/`allele2` objects with `.name` field.
118118
- Star allele calls may differ from other pipelines (e.g., Sanitas hg19 vs our hg38 DeepVariant). PharmCAT 2.15.5 definitions update frequently.
119+
- **Pipeline pin vs upstream**: The pipeline is currently pinned to PharmCAT `2.15.5` for reproducibility, but upstream PharmCAT releases continue to ship new guideline content and parser-relevant format changes. Before bumping the Docker tag, revalidate both step 7 and step 27 end-to-end — the JSON structure and preprocessor flags have changed between major versions.
119120

120121
### plink2 (PRS / Ancestry)
121122
- **chrX requires sex info**: Use `--chr 1-22 --allow-extra-chr` for PRS/PCA (autosomal only).
122123
- **`--output-chr chrM`** preserves `chr` prefix in output. Without it, `--chr 1-22` strips prefix → variant IDs become `1:pos` instead of `chr1:pos`.
123124
- **`--set-all-var-ids '@:#'`**: The `@` placeholder includes the full contig name (including `chr`). Do NOT use `chr@:#` or you get `chrchr1:pos`.
124125
- **Scoring file duplicates**: Large PGS Catalog files (e.g., PGS000014 with 7M variants) contain duplicate variant:allele pairs. Deduplicate before `--score` or plink2 errors.
125126
- **LD pruning requires >=50 samples**. PCA requires >=2. Single-sample ancestry is fundamentally limited.
127+
- **PRS guardrail**: Raw PRS scores are NOT percentiles, absolute risks, or portable labels across tool versions. Never describe them that way unless you have an ancestry-matched reference cohort scored with the exact same PGS file and preprocessing.
128+
- **Ancestry guardrail**: Treat the current single-sample ancestry step as overlap/QC plus a starting point for downstream projection work, not as a population-placement tool by itself.
126129

127130
### bcftools
128131
- **`bcftools sort` requires `##contig` headers** — fails silently or errors on VCFs without them. Always inject contig headers from the reference `.fai` when building VCFs.
@@ -134,17 +137,27 @@ User's FASTQ/BAM/VCF
134137
### Cyrius (CYP2D6)
135138
- Returns `None/None` for both samples — common limitation of short-read WGS due to CYP2D7 homology and structural rearrangements.
136139

137-
## Database Update Cadence
140+
## Knowledge Base / Tool Update Cadence
138141

139-
| Database | Update Frequency | Re-run Steps | Time |
142+
| Resource / Tool | Update Frequency | Re-run Steps | Time |
140143
|---|---|---|---|
141-
| ClinVar | Monthly | 6 (ClinVar screen) | ~5 min |
142-
| VEP cache | Every 6 months (Ensembl release) | 13, 23 | ~3 hr |
143-
| PCGR/CPSR data | Annually | 17 | ~45 min |
144-
| PharmCAT | Check quarterly | 7, 27 | ~15 min |
145-
| PGS Catalog | Check quarterly | 25 | ~30 min |
144+
| ClinVar | Monthly full release (first Thursday) + optional weekly Monday deltas | 6 (ClinVar screen) | ~5 min |
145+
| Ensembl / VEP cache | Each Ensembl release (~6 months; release 115 current, 116 expected Apr 2026) | 13, 23 | ~3 hr |
146+
| PCGR/CPSR data | Annually or when upstream bundle changes materially | 17 | ~45 min |
147+
| PharmCAT upstream release | Check quarterly; latest known upstream was 3.2.0 (2026-02-25) while pipeline stays pinned to 2.15.5 | 7, 27 | ~15-30 min validation |
148+
| CPIC / ClinPGx guideline surface | Check quarterly and whenever a relevant drug-gene pair changes upstream | 27 | ~15 min code refresh |
149+
| PGS Catalog | Check quarterly against the latest release page; treat scoring-file version changes as result-changing events | 25 | ~30 min |
146150

147151
ClinVar is the highest-value update — new pathogenic classifications happen monthly.
152+
Before bumping PharmCAT, validate the preprocessor flags, JSON parsing in step 27, and any phenotype/diplotype changes on a known test sample.
153+
For a public pipeline, keep PGS IDs, PharmCAT Docker tags, and the CPIC lookup table explicitly versioned in git so result changes are auditable over time.
154+
155+
### Minimal Revalidation Before Publishing Updates
156+
157+
1. **ClinVar / VEP refresh**: run step 6 and step 23 on one known sample, then compare pathogenic hit counts and filtered clinical variant counts against the previous run.
158+
2. **PharmCAT / CPIC refresh**: run step 7 and step 27 on one known sample, then diff diplotypes, phenotypes, and recommendation text before accepting the update.
159+
3. **PGS Catalog refresh**: rerun step 25 and compare both `variants_used/variants_total` and raw score deltas. If the scoring file version changed, treat the new output as a new baseline, not as directly comparable to the old one.
160+
4. **Documentation refresh**: update pinned versions, cadence notes, and any changed interpretation guardrails in docs before merging.
148161

149162
## Common Issues When Developing
150163

SECURITY.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Security Policy
2+
3+
## Reporting Security Issues
4+
5+
**Please do not report security vulnerabilities through public GitHub issues.**
6+
7+
Instead, please use GitHub's private vulnerability reporting:
8+
9+
1. Go to the **Security** tab of this repository
10+
2. Click **"Report a vulnerability"**
11+
3. Fill out the form with details
12+
13+
I will respond within **48 hours** and work with you to understand and address the issue.
14+
15+
### What to Include
16+
17+
- Type of issue (e.g., command injection, path traversal, arbitrary file overwrite)
18+
- Full paths of affected source files
19+
- Step-by-step instructions to reproduce
20+
- Proof-of-concept or exploit code (if possible)
21+
- Impact assessment and potential attack scenarios
22+
23+
## Supported Versions
24+
25+
Only the latest version receives security updates. Please always use the most recent release.
26+
27+
## Security Best Practices for Contributors
28+
29+
1. **Never commit secrets** — use environment variables
30+
2. **Validate all input** — especially sample names, paths, and user-provided files
31+
3. **Keep dependencies updated** — Dependabot is enabled on this repo
32+
4. **Prefer reproducible pins** for Docker images, databases, and tool versions
33+
34+
## Contact
35+
36+
For security questions that aren't vulnerabilities, open a regular issue or start a discussion in a pull request.

docs/07-pharmacogenomics.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Identifies which drugs work well, which need dose adjustments, and which to avoi
88

99
## Tool
1010
- **PharmCAT** v2.15.5 (Pharmacogenomics Clinical Annotation Tool, CPIC/PharmGKB)
11+
- This pipeline is intentionally pinned to `2.15.5` for reproducibility. Newer PharmCAT releases may exist upstream, but step 7 and step 27 should be revalidated together before changing versions.
1112

1213
## Docker Image
1314
```
@@ -19,23 +20,34 @@ pgkb/pharmcat:2.15.5
1920
SAMPLE=your_sample
2021
GENOME_DIR=/path/to/your/data
2122

22-
# PharmCAT needs the reference genome for VCF preprocessing
23+
# Step 1: preprocess the VCF against the GRCh38 reference
2324
docker run --rm \
2425
--cpus 2 --memory 4g \
2526
-v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
2627
-v ${GENOME_DIR}/reference:/ref \
2728
pgkb/pharmcat:2.15.5 \
28-
java -jar /pharmcat/pharmcat.jar \
29+
python3 /pharmcat/pharmcat_vcf_preprocessor.py \
2930
-vcf /data/${SAMPLE}.vcf.gz \
30-
-refFasta /ref/Homo_sapiens_assembly38.fasta \
31+
-refFna /ref/Homo_sapiens_assembly38.fasta \
3132
-o /data/ \
3233
-bf ${SAMPLE}
3334

34-
# Output: ${SAMPLE}.report.html (interactive HTML report)
35+
# Step 2: run PharmCAT on the preprocessed VCF
36+
docker run --rm \
37+
--cpus 2 --memory 4g \
38+
-v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
39+
pgkb/pharmcat:2.15.5 \
40+
java -jar /pharmcat/pharmcat.jar \
41+
-vcf /data/${SAMPLE}.preprocessed.vcf.bgz \
42+
-o /data/ \
43+
-bf ${SAMPLE} \
44+
-reporterJson
3545
```
3646

3747
## Output
3848
- HTML report with drug recommendations per gene
49+
- JSON report used by step 27 (`${SAMPLE}.report.json`)
50+
- Preprocessed VCF (`${SAMPLE}.preprocessed.vcf.bgz`) generated as an intermediate
3951
- Covers CYP2C19, CYP2D6, CYP2B6, CYP3A4/5, UGT1A1, DPYD, NAT2, TPMT, etc.
4052
- Star allele calls with metabolizer status (Poor/Intermediate/Normal/Rapid/Ultra-rapid)
4153

@@ -50,4 +62,10 @@ docker run --rm \
5062

5163
## Limitations
5264
- **CYP2D6** often returns `Not called` — gene has pseudogene homology that confounds VCF-based calling. Use Cyrius or StellarPGx (BAM-based) if CYP2D6 is critical.
53-
- PharmCAT may disagree with lab reports on complex haplotypes (e.g., NAT2). When in doubt, trust PharmCAT + raw VCF over lab transcription.
65+
- PharmCAT may disagree with lab reports on complex haplotypes (e.g., NAT2). Discrepancies can arise from different genome builds (hg19 vs hg38), different star allele definitions, or different variant calling pipelines. When a discrepancy matters clinically, compare both sets of raw variant calls and consult the PharmVar database for the current allele definitions — do not blindly trust either source.
66+
- PharmCAT output structure changes across releases. If you upgrade PharmCAT, re-test step 27 (`27-cpic-lookup.sh`) because it parses the JSON output directly.
67+
68+
## Maintenance
69+
- The pipeline is pinned to `pgkb/pharmcat:2.15.5` for reproducibility, but upstream PharmCAT keeps moving. Latest known upstream release when this doc was last checked was `3.2.0` (2026-02-25).
70+
- Treat **step 7 and step 27 as one upgrade unit**. If you bump PharmCAT, rerun both on a known sample and diff diplotypes, phenotypes, JSON structure, and CPIC recommendation text before merging.
71+
- Recheck CPIC / ClinPGx guidance at least quarterly, or sooner if a drug-gene pair you expose in step 27 gets a meaningful update upstream.

docs/09-str-expansions.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Screens for pathogenic repeat expansions — a class of mutations invisible to b
77
STR expansions cause ~40 known neurological/neuromuscular diseases including Huntington's, Fragile X, Friedreich's ataxia, ALS/FTD, myotonic dystrophy, and multiple spinocerebellar ataxias.
88

99
## Tool
10-
- **ExpansionHunter** v2.5.5 (Illumina)
10+
- **ExpansionHunter** v2.5.5 (Illumina) — note: v5.x is available upstream with an expanded catalog and `--variant-catalog` flag, but the Docker image used here ships v2.5.5
1111

1212
## Docker Image
1313
```
@@ -20,12 +20,23 @@ weisburd/expansionhunter:latest
2020
| Disease | Gene | Repeat Unit | Normal | Pathogenic |
2121
|---|---|---|---|---|
2222
| Huntington's | HTT | CAG | <27 | >35 |
23-
| Fragile X | FMR1 | CGG | <45 | >55 (premutation) / >200 (full) |
23+
| Fragile X | FMR1 | CGG | <45 | 55 (premutation) / >200 (full) |
2424
| Friedreich's Ataxia | FXN | GAA | <33 | >66 |
25-
| ALS/FTD | C9ORF72 | GGCCCC | <20 | >30 |
26-
| Myotonic Dystrophy | DMPK | CAG | <35 | >50 |
27-
| SCA1 | ATXN1 | TGC | <33 | >39 |
28-
| SCA2 | ATXN2 | GCT | <22 | >33 |
25+
| ALS/FTD | C9ORF72 | GGCCCC | <24 | >30 |
26+
| Myotonic Dystrophy 1 | DMPK | CTG | <35 | >50 |
27+
| SCA1 | ATXN1 | CAG | <33 | >39 |
28+
| SCA2 | ATXN2 | CAG | <22 | >33 |
29+
30+
## FMR1 Clinical Zones
31+
32+
FMR1 (Fragile X) has four distinct clinical zones — the intermediate zone (45-54 repeats) is often omitted but clinically relevant:
33+
34+
| Zone | Repeats | Clinical Significance |
35+
|---|---|---|
36+
| Normal | <45 | No risk |
37+
| Intermediate (gray zone) | 45-54 | Not affected, but repeats may expand in offspring. Genetic counseling recommended for carriers. |
38+
| Premutation | 55-200 | Risk of FXTAS (tremor/ataxia, males >50), FXPOI (premature ovarian insufficiency). Offspring at risk of full expansion. |
39+
| Full mutation | >200 | Fragile X syndrome (intellectual disability, behavioral features). Penetrance varies by sex and methylation. |
2940

3041
## Notes
3142
- This is v2.5.5, NOT v5.x. The flag is `--repeat-specs` (directory), not `--variant-catalog`

docs/17-cpsr.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Step 17: Cancer Predisposition Screening with CPSR
22

33
## What This Does
4-
Screens germline variants against ACMG SF v3.2 (Secondary Findings) and curated cancer predisposition gene panels to identify clinically actionable cancer risk variants.
4+
Screens germline variants against curated cancer predisposition gene panels to identify clinically actionable cancer risk variants. CPSR uses its own panels sourced from Genomics England PanelApp and other curated databases — these are cancer-focused and distinct from the 81-gene ACMG SF v3.2 list (which also includes cardiac and metabolic genes not covered by CPSR).
55

66
## Why
77
ClinVar screening (step 6) finds known pathogenic variants, but CPSR applies ACMG/AMP classification criteria to novel or rare variants in cancer predisposition genes — catching variants ClinVar hasn't yet classified.
@@ -45,7 +45,7 @@ docker run --rm \
4545
## Panel Options
4646
| Panel ID | Description |
4747
|---|---|
48-
| 0 | Full ACMG SF v3.2 (81 genes) — recommended |
48+
| 0 | Comprehensive cancer superpanel (500+ genes) — recommended |
4949
| 1 | Adult-onset hereditary cancer |
5050
| 2 | Childhood-onset hereditary cancer |
5151
| 3 | Lynch syndrome |
@@ -61,6 +61,7 @@ docker run --rm \
6161

6262
## Notes
6363
- The 21GB data bundle only needs to be downloaded once — shared across all samples.
64+
- **Data bundle staleness:** The default bundle (`grch38.20220203`) dates from February 2022. ClinVar, CancerMine, and UniProt annotations inside the bundle are frozen at that date. Check the [PCGR releases page](https://github.com/sigven/pcgr/releases) periodically for updated bundles — newer bundles include more recent ClinVar classifications and gene-disease annotations.
6465
- Use `--panel_id 0` for comprehensive screening (all ACMG SF genes).
6566
- `--classify_all` ensures all variants in target genes get ACMG classification, not just known pathogenic.
6667
- CPSR is complementary to ClinVar screening — ClinVar finds known variants, CPSR classifies novel ones.

docs/25-prs.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,11 +80,14 @@ The summary TSV contains a raw score for each condition. Here is what the column
8080
- They are NOT percentiles. A raw score of 0.5 does not mean 50th percentile.
8181
- They are NOT probabilities. A high score does not mean you will develop the condition.
8282
- They are NOT comparable across conditions. A score of 10 for CAD and 10 for T2D mean entirely different things.
83+
- They are NOT stable across arbitrary pipeline changes. If you change the PGS file version, genome build harmonization, or variant matching rules, you need to recompute and reinterpret the score.
8384

8485
### How to make them meaningful
8586

8687
Raw PRS become useful only when compared against a population distribution. To convert your score into a percentile, you need a reference panel of thousands of individuals with scores computed using the same scoring file. The PGS Catalog provides some population-level statistics, but full percentile calculation requires a reference cohort (not included in this pipeline).
8788

89+
Comparing two people is only defensible when both were scored with the same PGS ID, the same scoring file version, the same genome build conventions, and the same preprocessing. Even then, treat the comparison as directional rather than clinically calibrated unless you also have a matched reference distribution.
90+
8891
As a rough guide:
8992
- Score near the population mean = average genetic risk
9093
- Score >1 standard deviation above mean = elevated risk (top ~16%)
@@ -110,6 +113,12 @@ Check the `Variants_Used / Variants_Total` ratio. If fewer than 50% of scoring v
110113
- The script prefers GRCh38-harmonized scoring files. If unavailable, it falls back to the original (which may be on GRCh37 and produce poor variant matching).
111114
- You can add more PGS IDs by editing the `PGS_IDS` associative array in the script. Browse available scores at [pgscatalog.org](https://www.pgscatalog.org/).
112115

116+
## Maintenance
117+
118+
- Recheck the PGS Catalog against its latest release page at least quarterly before treating this step as "current."
119+
- A scoring file update is a **result-changing event**. If the harmonized file version/date changes, rerun step 25 and treat the output as a new baseline.
120+
- If you publish or compare PRS results over time, keep the `PGS ID`, the harmonized scoring file version/date, and the pipeline commit together so score changes remain auditable.
121+
113122
## Links
114123

115124
- [PGS Catalog](https://www.pgscatalog.org/)

docs/27-cpic-lookup.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ Only genes where you are NOT a normal metabolizer appear here. For each, the rep
122122
- Run this step after PharmCAT (step 7). For CYP2D6, also run Cyrius (step 21) and manually compare.
123123
- The output report is printed to stdout as well as written to file.
124124
- You can add or modify gene-drug pairs by editing the `CPIC_DRUGS` associative array in the script.
125+
- For maintenance, review the hard-coded CPIC table at least quarterly or whenever you bump PharmCAT, so the lookup stays aligned with current guideline pairs.
125126
- For the most up-to-date CPIC recommendations, always check [cpicpgx.org](https://cpicpgx.org/) directly.
126127

127128
## Links

0 commit comments

Comments
 (0)