You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- JSON output structure: `genes` is `{source → {gene_name → data}}` (dict of dicts), NOT a list. `sourceDiplotypes` contains `allele1`/`allele2` objects with `.name` field.
118
118
- Star allele calls may differ from other pipelines (e.g., Sanitas hg19 vs our hg38 DeepVariant). PharmCAT 2.15.5 definitions update frequently.
119
+
-**Pipeline pin vs upstream**: The pipeline is currently pinned to PharmCAT `2.15.5` for reproducibility, but upstream PharmCAT releases continue to ship new guideline content and parser-relevant format changes. Before bumping the Docker tag, revalidate both step 7 and step 27 end-to-end — the JSON structure and preprocessor flags have changed between major versions.
119
120
120
121
### plink2 (PRS / Ancestry)
121
122
-**chrX requires sex info**: Use `--chr 1-22 --allow-extra-chr` for PRS/PCA (autosomal only).
122
123
-**`--output-chr chrM`** preserves `chr` prefix in output. Without it, `--chr 1-22` strips prefix → variant IDs become `1:pos` instead of `chr1:pos`.
123
124
-**`--set-all-var-ids '@:#'`**: The `@` placeholder includes the full contig name (including `chr`). Do NOT use `chr@:#` or you get `chrchr1:pos`.
124
125
-**Scoring file duplicates**: Large PGS Catalog files (e.g., PGS000014 with 7M variants) contain duplicate variant:allele pairs. Deduplicate before `--score` or plink2 errors.
-**PRS guardrail**: Raw PRS scores are NOT percentiles, absolute risks, or portable labels across tool versions. Never describe them that way unless you have an ancestry-matched reference cohort scored with the exact same PGS file and preprocessing.
128
+
-**Ancestry guardrail**: Treat the current single-sample ancestry step as overlap/QC plus a starting point for downstream projection work, not as a population-placement tool by itself.
126
129
127
130
### bcftools
128
131
-**`bcftools sort` requires `##contig` headers** — fails silently or errors on VCFs without them. Always inject contig headers from the reference `.fai` when building VCFs.
@@ -134,17 +137,27 @@ User's FASTQ/BAM/VCF
134
137
### Cyrius (CYP2D6)
135
138
- Returns `None/None` for both samples — common limitation of short-read WGS due to CYP2D7 homology and structural rearrangements.
136
139
137
-
## Database Update Cadence
140
+
## Knowledge Base / Tool Update Cadence
138
141
139
-
|Database| Update Frequency | Re-run Steps | Time |
142
+
|Resource / Tool| Update Frequency | Re-run Steps | Time |
| PCGR/CPSR data | Annually or when upstream bundle changes materially | 17 |~45 min |
147
+
| PharmCAT upstream release | Check quarterly; latest known upstream was 3.2.0 (2026-02-25) while pipeline stays pinned to 2.15.5 | 7, 27 |~15-30 min validation |
148
+
| CPIC / ClinPGx guideline surface | Check quarterly and whenever a relevant drug-gene pair changes upstream | 27 |~15 min code refresh |
149
+
| PGS Catalog | Check quarterly against the latest release page; treat scoring-file version changes as result-changing events | 25 |~30 min |
146
150
147
151
ClinVar is the highest-value update — new pathogenic classifications happen monthly.
152
+
Before bumping PharmCAT, validate the preprocessor flags, JSON parsing in step 27, and any phenotype/diplotype changes on a known test sample.
153
+
For a public pipeline, keep PGS IDs, PharmCAT Docker tags, and the CPIC lookup table explicitly versioned in git so result changes are auditable over time.
154
+
155
+
### Minimal Revalidation Before Publishing Updates
156
+
157
+
1.**ClinVar / VEP refresh**: run step 6 and step 23 on one known sample, then compare pathogenic hit counts and filtered clinical variant counts against the previous run.
158
+
2.**PharmCAT / CPIC refresh**: run step 7 and step 27 on one known sample, then diff diplotypes, phenotypes, and recommendation text before accepting the update.
159
+
3.**PGS Catalog refresh**: rerun step 25 and compare both `variants_used/variants_total` and raw score deltas. If the scoring file version changed, treat the new output as a new baseline, not as directly comparable to the old one.
160
+
4.**Documentation refresh**: update pinned versions, cadence notes, and any changed interpretation guardrails in docs before merging.
- This pipeline is intentionally pinned to `2.15.5` for reproducibility. Newer PharmCAT releases may exist upstream, but step 7 and step 27 should be revalidated together before changing versions.
11
12
12
13
## Docker Image
13
14
```
@@ -19,23 +20,34 @@ pgkb/pharmcat:2.15.5
19
20
SAMPLE=your_sample
20
21
GENOME_DIR=/path/to/your/data
21
22
22
-
#PharmCAT needs the reference genome for VCF preprocessing
23
+
#Step 1: preprocess the VCF against the GRCh38 reference
23
24
docker run --rm \
24
25
--cpus 2 --memory 4g \
25
26
-v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
26
27
-v ${GENOME_DIR}/reference:/ref \
27
28
pgkb/pharmcat:2.15.5 \
28
-
java -jar /pharmcat/pharmcat.jar \
29
+
python3 /pharmcat/pharmcat_vcf_preprocessor.py \
29
30
-vcf /data/${SAMPLE}.vcf.gz \
30
-
-refFasta /ref/Homo_sapiens_assembly38.fasta \
31
+
-refFna /ref/Homo_sapiens_assembly38.fasta \
31
32
-o /data/ \
32
33
-bf ${SAMPLE}
33
34
34
-
# Output: ${SAMPLE}.report.html (interactive HTML report)
35
+
# Step 2: run PharmCAT on the preprocessed VCF
36
+
docker run --rm \
37
+
--cpus 2 --memory 4g \
38
+
-v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
39
+
pgkb/pharmcat:2.15.5 \
40
+
java -jar /pharmcat/pharmcat.jar \
41
+
-vcf /data/${SAMPLE}.preprocessed.vcf.bgz \
42
+
-o /data/ \
43
+
-bf ${SAMPLE} \
44
+
-reporterJson
35
45
```
36
46
37
47
## Output
38
48
- HTML report with drug recommendations per gene
49
+
- JSON report used by step 27 (`${SAMPLE}.report.json`)
50
+
- Preprocessed VCF (`${SAMPLE}.preprocessed.vcf.bgz`) generated as an intermediate
- Star allele calls with metabolizer status (Poor/Intermediate/Normal/Rapid/Ultra-rapid)
41
53
@@ -50,4 +62,10 @@ docker run --rm \
50
62
51
63
## Limitations
52
64
-**CYP2D6** often returns `Not called` — gene has pseudogene homology that confounds VCF-based calling. Use Cyrius or StellarPGx (BAM-based) if CYP2D6 is critical.
53
-
- PharmCAT may disagree with lab reports on complex haplotypes (e.g., NAT2). When in doubt, trust PharmCAT + raw VCF over lab transcription.
65
+
- PharmCAT may disagree with lab reports on complex haplotypes (e.g., NAT2). Discrepancies can arise from different genome builds (hg19 vs hg38), different star allele definitions, or different variant calling pipelines. When a discrepancy matters clinically, compare both sets of raw variant calls and consult the PharmVar database for the current allele definitions — do not blindly trust either source.
66
+
- PharmCAT output structure changes across releases. If you upgrade PharmCAT, re-test step 27 (`27-cpic-lookup.sh`) because it parses the JSON output directly.
67
+
68
+
## Maintenance
69
+
- The pipeline is pinned to `pgkb/pharmcat:2.15.5` for reproducibility, but upstream PharmCAT keeps moving. Latest known upstream release when this doc was last checked was `3.2.0` (2026-02-25).
70
+
- Treat **step 7 and step 27 as one upgrade unit**. If you bump PharmCAT, rerun both on a known sample and diff diplotypes, phenotypes, JSON structure, and CPIC recommendation text before merging.
71
+
- Recheck CPIC / ClinPGx guidance at least quarterly, or sooner if a drug-gene pair you expose in step 27 gets a meaningful update upstream.
Copy file name to clipboardExpand all lines: docs/09-str-expansions.md
+17-6Lines changed: 17 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ Screens for pathogenic repeat expansions — a class of mutations invisible to b
7
7
STR expansions cause ~40 known neurological/neuromuscular diseases including Huntington's, Fragile X, Friedreich's ataxia, ALS/FTD, myotonic dystrophy, and multiple spinocerebellar ataxias.
8
8
9
9
## Tool
10
-
-**ExpansionHunter** v2.5.5 (Illumina)
10
+
-**ExpansionHunter** v2.5.5 (Illumina) — note: v5.x is available upstream with an expanded catalog and `--variant-catalog` flag, but the Docker image used here ships v2.5.5
Copy file name to clipboardExpand all lines: docs/17-cpsr.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Step 17: Cancer Predisposition Screening with CPSR
2
2
3
3
## What This Does
4
-
Screens germline variants against ACMG SF v3.2 (Secondary Findings) and curated cancer predisposition gene panels to identify clinically actionable cancer risk variants.
4
+
Screens germline variants against curated cancer predisposition gene panels to identify clinically actionable cancer risk variants. CPSR uses its own panels sourced from Genomics England PanelApp and other curated databases — these are cancer-focused and distinct from the 81-gene ACMG SF v3.2 list (which also includes cardiac and metabolic genes not covered by CPSR).
5
5
6
6
## Why
7
7
ClinVar screening (step 6) finds known pathogenic variants, but CPSR applies ACMG/AMP classification criteria to novel or rare variants in cancer predisposition genes — catching variants ClinVar hasn't yet classified.
@@ -45,7 +45,7 @@ docker run --rm \
45
45
## Panel Options
46
46
| Panel ID | Description |
47
47
|---|---|
48
-
| 0 |Full ACMG SF v3.2 (81 genes) — recommended |
48
+
| 0 |Comprehensive cancer superpanel (500+ genes) — recommended |
49
49
| 1 | Adult-onset hereditary cancer |
50
50
| 2 | Childhood-onset hereditary cancer |
51
51
| 3 | Lynch syndrome |
@@ -61,6 +61,7 @@ docker run --rm \
61
61
62
62
## Notes
63
63
- The 21GB data bundle only needs to be downloaded once — shared across all samples.
64
+
-**Data bundle staleness:** The default bundle (`grch38.20220203`) dates from February 2022. ClinVar, CancerMine, and UniProt annotations inside the bundle are frozen at that date. Check the [PCGR releases page](https://github.com/sigven/pcgr/releases) periodically for updated bundles — newer bundles include more recent ClinVar classifications and gene-disease annotations.
64
65
- Use `--panel_id 0` for comprehensive screening (all ACMG SF genes).
65
66
-`--classify_all` ensures all variants in target genes get ACMG classification, not just known pathogenic.
66
67
- CPSR is complementary to ClinVar screening — ClinVar finds known variants, CPSR classifies novel ones.
Copy file name to clipboardExpand all lines: docs/25-prs.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,11 +80,14 @@ The summary TSV contains a raw score for each condition. Here is what the column
80
80
- They are NOT percentiles. A raw score of 0.5 does not mean 50th percentile.
81
81
- They are NOT probabilities. A high score does not mean you will develop the condition.
82
82
- They are NOT comparable across conditions. A score of 10 for CAD and 10 for T2D mean entirely different things.
83
+
- They are NOT stable across arbitrary pipeline changes. If you change the PGS file version, genome build harmonization, or variant matching rules, you need to recompute and reinterpret the score.
83
84
84
85
### How to make them meaningful
85
86
86
87
Raw PRS become useful only when compared against a population distribution. To convert your score into a percentile, you need a reference panel of thousands of individuals with scores computed using the same scoring file. The PGS Catalog provides some population-level statistics, but full percentile calculation requires a reference cohort (not included in this pipeline).
87
88
89
+
Comparing two people is only defensible when both were scored with the same PGS ID, the same scoring file version, the same genome build conventions, and the same preprocessing. Even then, treat the comparison as directional rather than clinically calibrated unless you also have a matched reference distribution.
90
+
88
91
As a rough guide:
89
92
- Score near the population mean = average genetic risk
90
93
- Score >1 standard deviation above mean = elevated risk (top ~16%)
@@ -110,6 +113,12 @@ Check the `Variants_Used / Variants_Total` ratio. If fewer than 50% of scoring v
110
113
- The script prefers GRCh38-harmonized scoring files. If unavailable, it falls back to the original (which may be on GRCh37 and produce poor variant matching).
111
114
- You can add more PGS IDs by editing the `PGS_IDS` associative array in the script. Browse available scores at [pgscatalog.org](https://www.pgscatalog.org/).
112
115
116
+
## Maintenance
117
+
118
+
- Recheck the PGS Catalog against its latest release page at least quarterly before treating this step as "current."
119
+
- A scoring file update is a **result-changing event**. If the harmonized file version/date changes, rerun step 25 and treat the output as a new baseline.
120
+
- If you publish or compare PRS results over time, keep the `PGS ID`, the harmonized scoring file version/date, and the pipeline commit together so score changes remain auditable.
Copy file name to clipboardExpand all lines: docs/27-cpic-lookup.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,6 +122,7 @@ Only genes where you are NOT a normal metabolizer appear here. For each, the rep
122
122
- Run this step after PharmCAT (step 7). For CYP2D6, also run Cyrius (step 21) and manually compare.
123
123
- The output report is printed to stdout as well as written to file.
124
124
- You can add or modify gene-drug pairs by editing the `CPIC_DRUGS` associative array in the script.
125
+
- For maintenance, review the hard-coded CPIC table at least quarterly or whenever you bump PharmCAT, so the lookup stays aligned with current guideline pairs.
125
126
- For the most up-to-date CPIC recommendations, always check [cpicpgx.org](https://cpicpgx.org/) directly.
0 commit comments