GeiserX
diff --git a/‎.github/FUNDING.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/FUNDING.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 29 additions & 0 deletions b/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎.github/dependabot.yml‎
Lines changed: 24 additions & 0 deletions b/‎.github/dependabot.yml‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 20 additions & 7 deletions b/‎AGENTS.md‎
Lines changed: 20 additions & 7 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 36 additions & 0 deletions b/‎SECURITY.md‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎docs/07-pharmacogenomics.md‎
Lines changed: 23 additions & 5 deletions b/‎docs/07-pharmacogenomics.md‎
Lines changed: 23 additions & 5 deletions
diff --git a/‎docs/09-str-expansions.md‎
Lines changed: 17 additions & 6 deletions b/‎docs/09-str-expansions.md‎
Lines changed: 17 additions & 6 deletions
diff --git a/‎docs/17-cpsr.md‎
Lines changed: 3 additions & 2 deletions b/‎docs/17-cpsr.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/25-prs.md‎
Lines changed: 9 additions & 0 deletions b/‎docs/25-prs.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/27-cpic-lookup.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/27-cpic-lookup.md‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,4 @@
+github: geiserx
+patreon: geiser
+buy_me_a_coffee: geiser
+thanks_dev: u/gh/geiserx
@@ -0,0 +1,29 @@
+## Summary
+
+<!-- What does this PR change and why? -->
+
+## Type of Change
+
+- [ ] Bug fix
+- [ ] New feature
+- [ ] Documentation update
+- [ ] Infrastructure / CI change
+- [ ] Refactor / cleanup
+
+## Checklist
+
+- [ ] Relevant scripts/docs stay in sync
+- [ ] `shellcheck` passes for changed shell scripts
+- [ ] Smoke-test assumptions were checked for changed pipeline contracts
+- [ ] No personal paths, hostnames, or sample-specific defaults were introduced
+- [ ] No secrets or credentials were committed
+
+## Testing
+
+- [ ] Tested locally
+- [ ] CI passes
+- [ ] Not applicable
+
+## Notes
+
+<!-- Anything reviewers should pay special attention to? -->
@@ -0,0 +1,24 @@
+version: 2
+
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "monthly"
+    open-pull-requests-limit: 10
+    commit-message:
+      prefix: "ci"
+    labels:
+      - "ci"
+      - "automated"
+    groups:
+      github-actions:
+        patterns:
+          - "*"
+      github_actions_security:
+        applies-to: security-updates
+        patterns:
+          - "*"
+    ignore:
+      - dependency-name: "*"
+        update-types: ["version-update:semver-major"]
@@ -116,13 +116,16 @@ User's FASTQ/BAM/VCF
 - Preprocessor outputs `.preprocessed.vcf.bgz` (NOT `.vcf`).
 - JSON output structure: `genes` is `{source → {gene_name → data}}` (dict of dicts), NOT a list. `sourceDiplotypes` contains `allele1`/`allele2` objects with `.name` field.
 - Star allele calls may differ from other pipelines (e.g., Sanitas hg19 vs our hg38 DeepVariant). PharmCAT 2.15.5 definitions update frequently.
+- **Pipeline pin vs upstream**: The pipeline is currently pinned to PharmCAT `2.15.5` for reproducibility, but upstream PharmCAT releases continue to ship new guideline content and parser-relevant format changes. Before bumping the Docker tag, revalidate both step 7 and step 27 end-to-end — the JSON structure and preprocessor flags have changed between major versions.
 
 ### plink2 (PRS / Ancestry)
 - **chrX requires sex info**: Use `--chr 1-22 --allow-extra-chr` for PRS/PCA (autosomal only).
 - **`--output-chr chrM`** preserves `chr` prefix in output. Without it, `--chr 1-22` strips prefix → variant IDs become `1:pos` instead of `chr1:pos`.
 - **`--set-all-var-ids '@:#'`**: The `@` placeholder includes the full contig name (including `chr`). Do NOT use `chr@:#` or you get `chrchr1:pos`.
 - **Scoring file duplicates**: Large PGS Catalog files (e.g., PGS000014 with 7M variants) contain duplicate variant:allele pairs. Deduplicate before `--score` or plink2 errors.
 - **LD pruning requires >=50 samples**. PCA requires >=2. Single-sample ancestry is fundamentally limited.
+- **PRS guardrail**: Raw PRS scores are NOT percentiles, absolute risks, or portable labels across tool versions. Never describe them that way unless you have an ancestry-matched reference cohort scored with the exact same PGS file and preprocessing.
+- **Ancestry guardrail**: Treat the current single-sample ancestry step as overlap/QC plus a starting point for downstream projection work, not as a population-placement tool by itself.
 
 ### bcftools
 - **`bcftools sort` requires `##contig` headers** — fails silently or errors on VCFs without them. Always inject contig headers from the reference `.fai` when building VCFs.
@@ -134,17 +137,27 @@ User's FASTQ/BAM/VCF
 ### Cyrius (CYP2D6)
 - Returns `None/None` for both samples — common limitation of short-read WGS due to CYP2D7 homology and structural rearrangements.
 
-## Database Update Cadence
+## Knowledge Base / Tool Update Cadence
 
-| Database | Update Frequency | Re-run Steps | Time |
+| Resource / Tool | Update Frequency | Re-run Steps | Time |
 |---|---|---|---|
-| ClinVar | Monthly | 6 (ClinVar screen) | ~5 min |
-| VEP cache | Every 6 months (Ensembl release) | 13, 23 | ~3 hr |
-| PCGR/CPSR data | Annually | 17 | ~45 min |
-| PharmCAT | Check quarterly | 7, 27 | ~15 min |
-| PGS Catalog | Check quarterly | 25 | ~30 min |
+| ClinVar | Monthly full release (first Thursday) + optional weekly Monday deltas | 6 (ClinVar screen) | ~5 min |
+| Ensembl / VEP cache | Each Ensembl release (~6 months; release 115 current, 116 expected Apr 2026) | 13, 23 | ~3 hr |
+| PCGR/CPSR data | Annually or when upstream bundle changes materially | 17 | ~45 min |
+| PharmCAT upstream release | Check quarterly; latest known upstream was 3.2.0 (2026-02-25) while pipeline stays pinned to 2.15.5 | 7, 27 | ~15-30 min validation |
+| CPIC / ClinPGx guideline surface | Check quarterly and whenever a relevant drug-gene pair changes upstream | 27 | ~15 min code refresh |
+| PGS Catalog | Check quarterly against the latest release page; treat scoring-file version changes as result-changing events | 25 | ~30 min |
 
 ClinVar is the highest-value update — new pathogenic classifications happen monthly.
+Before bumping PharmCAT, validate the preprocessor flags, JSON parsing in step 27, and any phenotype/diplotype changes on a known test sample.
+For a public pipeline, keep PGS IDs, PharmCAT Docker tags, and the CPIC lookup table explicitly versioned in git so result changes are auditable over time.
+
+### Minimal Revalidation Before Publishing Updates
+
+1. **ClinVar / VEP refresh**: run step 6 and step 23 on one known sample, then compare pathogenic hit counts and filtered clinical variant counts against the previous run.
+2. **PharmCAT / CPIC refresh**: run step 7 and step 27 on one known sample, then diff diplotypes, phenotypes, and recommendation text before accepting the update.
+3. **PGS Catalog refresh**: rerun step 25 and compare both `variants_used/variants_total` and raw score deltas. If the scoring file version changed, treat the new output as a new baseline, not as directly comparable to the old one.
+4. **Documentation refresh**: update pinned versions, cadence notes, and any changed interpretation guardrails in docs before merging.
 
 ## Common Issues When Developing
 
 
@@ -0,0 +1,36 @@
+# Security Policy
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.**
+
+Instead, please use GitHub's private vulnerability reporting:
+
+1. Go to the **Security** tab of this repository
+2. Click **"Report a vulnerability"**
+3. Fill out the form with details
+
+I will respond within **48 hours** and work with you to understand and address the issue.
+
+### What to Include
+
+- Type of issue (e.g., command injection, path traversal, arbitrary file overwrite)
+- Full paths of affected source files
+- Step-by-step instructions to reproduce
+- Proof-of-concept or exploit code (if possible)
+- Impact assessment and potential attack scenarios
+
+## Supported Versions
+
+Only the latest version receives security updates. Please always use the most recent release.
+
+## Security Best Practices for Contributors
+
+1. **Never commit secrets** — use environment variables
+2. **Validate all input** — especially sample names, paths, and user-provided files
+3. **Keep dependencies updated** — Dependabot is enabled on this repo
+4. **Prefer reproducible pins** for Docker images, databases, and tool versions
+
+## Contact
+
+For security questions that aren't vulnerabilities, open a regular issue or start a discussion in a pull request.
@@ -8,6 +8,7 @@ Identifies which drugs work well, which need dose adjustments, and which to avoi
 
 ## Tool
 - **PharmCAT** v2.15.5 (Pharmacogenomics Clinical Annotation Tool, CPIC/PharmGKB)
+- This pipeline is intentionally pinned to `2.15.5` for reproducibility. Newer PharmCAT releases may exist upstream, but step 7 and step 27 should be revalidated together before changing versions.
 
 ## Docker Image
 ```
@@ -19,23 +20,34 @@ pgkb/pharmcat:2.15.5
 SAMPLE=your_sample
 GENOME_DIR=/path/to/your/data
 
-# PharmCAT needs the reference genome for VCF preprocessing
+# Step 1: preprocess the VCF against the GRCh38 reference
 docker run --rm \
   --cpus 2 --memory 4g \
   -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
   -v ${GENOME_DIR}/reference:/ref \
   pgkb/pharmcat:2.15.5 \
-  java -jar /pharmcat/pharmcat.jar \
+  python3 /pharmcat/pharmcat_vcf_preprocessor.py \
     -vcf /data/${SAMPLE}.vcf.gz \
-    -refFasta /ref/Homo_sapiens_assembly38.fasta \
+    -refFna /ref/Homo_sapiens_assembly38.fasta \
     -o /data/ \
     -bf ${SAMPLE}
 
-# Output: ${SAMPLE}.report.html (interactive HTML report)
+# Step 2: run PharmCAT on the preprocessed VCF
+docker run --rm \
+  --cpus 2 --memory 4g \
+  -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
+  pgkb/pharmcat:2.15.5 \
+  java -jar /pharmcat/pharmcat.jar \
+    -vcf /data/${SAMPLE}.preprocessed.vcf.bgz \
+    -o /data/ \
+    -bf ${SAMPLE} \
+    -reporterJson
 ```
 
 ## Output
 - HTML report with drug recommendations per gene
+- JSON report used by step 27 (`${SAMPLE}.report.json`)
+- Preprocessed VCF (`${SAMPLE}.preprocessed.vcf.bgz`) generated as an intermediate
 - Covers CYP2C19, CYP2D6, CYP2B6, CYP3A4/5, UGT1A1, DPYD, NAT2, TPMT, etc.
 - Star allele calls with metabolizer status (Poor/Intermediate/Normal/Rapid/Ultra-rapid)
 
@@ -50,4 +62,10 @@ docker run --rm \
 
 ## Limitations
 - **CYP2D6** often returns `Not called` — gene has pseudogene homology that confounds VCF-based calling. Use Cyrius or StellarPGx (BAM-based) if CYP2D6 is critical.
-- PharmCAT may disagree with lab reports on complex haplotypes (e.g., NAT2). When in doubt, trust PharmCAT + raw VCF over lab transcription.
+- PharmCAT may disagree with lab reports on complex haplotypes (e.g., NAT2). Discrepancies can arise from different genome builds (hg19 vs hg38), different star allele definitions, or different variant calling pipelines. When a discrepancy matters clinically, compare both sets of raw variant calls and consult the PharmVar database for the current allele definitions — do not blindly trust either source.
+- PharmCAT output structure changes across releases. If you upgrade PharmCAT, re-test step 27 (`27-cpic-lookup.sh`) because it parses the JSON output directly.
+
+## Maintenance
+- The pipeline is pinned to `pgkb/pharmcat:2.15.5` for reproducibility, but upstream PharmCAT keeps moving. Latest known upstream release when this doc was last checked was `3.2.0` (2026-02-25).
+- Treat **step 7 and step 27 as one upgrade unit**. If you bump PharmCAT, rerun both on a known sample and diff diplotypes, phenotypes, JSON structure, and CPIC recommendation text before merging.
+- Recheck CPIC / ClinPGx guidance at least quarterly, or sooner if a drug-gene pair you expose in step 27 gets a meaningful update upstream.
@@ -7,7 +7,7 @@ Screens for pathogenic repeat expansions — a class of mutations invisible to b
 STR expansions cause ~40 known neurological/neuromuscular diseases including Huntington's, Fragile X, Friedreich's ataxia, ALS/FTD, myotonic dystrophy, and multiple spinocerebellar ataxias.
 
 ## Tool
-- **ExpansionHunter** v2.5.5 (Illumina)
+- **ExpansionHunter** v2.5.5 (Illumina) — note: v5.x is available upstream with an expanded catalog and `--variant-catalog` flag, but the Docker image used here ships v2.5.5
 
 ## Docker Image
 ```
@@ -20,12 +20,23 @@ weisburd/expansionhunter:latest
 | Disease | Gene | Repeat Unit | Normal | Pathogenic |
 |---|---|---|---|---|
 | Huntington's | HTT | CAG | <27 | >35 |
-| Fragile X | FMR1 | CGG | <45 | >55 (premutation) / >200 (full) |
+| Fragile X | FMR1 | CGG | <45 | ≥55 (premutation) / >200 (full) |
 | Friedreich's Ataxia | FXN | GAA | <33 | >66 |
-| ALS/FTD | C9ORF72 | GGCCCC | <20 | >30 |
-| Myotonic Dystrophy | DMPK | CAG | <35 | >50 |
-| SCA1 | ATXN1 | TGC | <33 | >39 |
-| SCA2 | ATXN2 | GCT | <22 | >33 |
+| ALS/FTD | C9ORF72 | GGCCCC | <24 | >30 |
+| Myotonic Dystrophy 1 | DMPK | CTG | <35 | >50 |
+| SCA1 | ATXN1 | CAG | <33 | >39 |
+| SCA2 | ATXN2 | CAG | <22 | >33 |
+
+## FMR1 Clinical Zones
+
+FMR1 (Fragile X) has four distinct clinical zones — the intermediate zone (45-54 repeats) is often omitted but clinically relevant:
+
+| Zone | Repeats | Clinical Significance |
+|---|---|---|
+| Normal | <45 | No risk |
+| Intermediate (gray zone) | 45-54 | Not affected, but repeats may expand in offspring. Genetic counseling recommended for carriers. |
+| Premutation | 55-200 | Risk of FXTAS (tremor/ataxia, males >50), FXPOI (premature ovarian insufficiency). Offspring at risk of full expansion. |
+| Full mutation | >200 | Fragile X syndrome (intellectual disability, behavioral features). Penetrance varies by sex and methylation. |
 
 ## Notes
 - This is v2.5.5, NOT v5.x. The flag is `--repeat-specs` (directory), not `--variant-catalog`
 
@@ -1,7 +1,7 @@
 # Step 17: Cancer Predisposition Screening with CPSR
 
 ## What This Does
-Screens germline variants against ACMG SF v3.2 (Secondary Findings) and curated cancer predisposition gene panels to identify clinically actionable cancer risk variants.
+Screens germline variants against curated cancer predisposition gene panels to identify clinically actionable cancer risk variants. CPSR uses its own panels sourced from Genomics England PanelApp and other curated databases — these are cancer-focused and distinct from the 81-gene ACMG SF v3.2 list (which also includes cardiac and metabolic genes not covered by CPSR).
 
 ## Why
 ClinVar screening (step 6) finds known pathogenic variants, but CPSR applies ACMG/AMP classification criteria to novel or rare variants in cancer predisposition genes — catching variants ClinVar hasn't yet classified.
@@ -45,7 +45,7 @@ docker run --rm \
 ## Panel Options
 | Panel ID | Description |
 |---|---|
-| 0 | Full ACMG SF v3.2 (81 genes) — recommended |
+| 0 | Comprehensive cancer superpanel (500+ genes) — recommended |
 | 1 | Adult-onset hereditary cancer |
 | 2 | Childhood-onset hereditary cancer |
 | 3 | Lynch syndrome |
@@ -61,6 +61,7 @@ docker run --rm \
 
 ## Notes
 - The 21GB data bundle only needs to be downloaded once — shared across all samples.
+- **Data bundle staleness:** The default bundle (`grch38.20220203`) dates from February 2022. ClinVar, CancerMine, and UniProt annotations inside the bundle are frozen at that date. Check the [PCGR releases page](https://github.com/sigven/pcgr/releases) periodically for updated bundles — newer bundles include more recent ClinVar classifications and gene-disease annotations.
 - Use `--panel_id 0` for comprehensive screening (all ACMG SF genes).
 - `--classify_all` ensures all variants in target genes get ACMG classification, not just known pathogenic.
 - CPSR is complementary to ClinVar screening — ClinVar finds known variants, CPSR classifies novel ones.
 
@@ -80,11 +80,14 @@ The summary TSV contains a raw score for each condition. Here is what the column
 - They are NOT percentiles. A raw score of 0.5 does not mean 50th percentile.
 - They are NOT probabilities. A high score does not mean you will develop the condition.
 - They are NOT comparable across conditions. A score of 10 for CAD and 10 for T2D mean entirely different things.
+- They are NOT stable across arbitrary pipeline changes. If you change the PGS file version, genome build harmonization, or variant matching rules, you need to recompute and reinterpret the score.
 
 ### How to make them meaningful
 
 Raw PRS become useful only when compared against a population distribution. To convert your score into a percentile, you need a reference panel of thousands of individuals with scores computed using the same scoring file. The PGS Catalog provides some population-level statistics, but full percentile calculation requires a reference cohort (not included in this pipeline).
 
+Comparing two people is only defensible when both were scored with the same PGS ID, the same scoring file version, the same genome build conventions, and the same preprocessing. Even then, treat the comparison as directional rather than clinically calibrated unless you also have a matched reference distribution.
+
 As a rough guide:
 - Score near the population mean = average genetic risk
 - Score >1 standard deviation above mean = elevated risk (top ~16%)
@@ -110,6 +113,12 @@ Check the `Variants_Used / Variants_Total` ratio. If fewer than 50% of scoring v
 - The script prefers GRCh38-harmonized scoring files. If unavailable, it falls back to the original (which may be on GRCh37 and produce poor variant matching).
 - You can add more PGS IDs by editing the `PGS_IDS` associative array in the script. Browse available scores at [pgscatalog.org](https://www.pgscatalog.org/).
 
+## Maintenance
+
+- Recheck the PGS Catalog against its latest release page at least quarterly before treating this step as "current."
+- A scoring file update is a **result-changing event**. If the harmonized file version/date changes, rerun step 25 and treat the output as a new baseline.
+- If you publish or compare PRS results over time, keep the `PGS ID`, the harmonized scoring file version/date, and the pipeline commit together so score changes remain auditable.
+
 ## Links
 
 - [PGS Catalog](https://www.pgscatalog.org/)
 
@@ -122,6 +122,7 @@ Only genes where you are NOT a normal metabolizer appear here. For each, the rep
 - Run this step after PharmCAT (step 7). For CYP2D6, also run Cyrius (step 21) and manually compare.
 - The output report is printed to stdout as well as written to file.
 - You can add or modify gene-drug pairs by editing the `CPIC_DRUGS` associative array in the script.
+- For maintenance, review the hard-coded CPIC table at least quarterly or whenever you bump PharmCAT, so the lookup stays aligned with current guideline pairs.
 - For the most up-to-date CPIC recommendations, always check [cpicpgx.org](https://cpicpgx.org/) directly.
 
 ## Links