Skip to content

Commit e91d459

Browse files
committed
feat: analysis workflow
1 parent ec634fe commit e91d459

2 files changed

Lines changed: 140 additions & 36 deletions

File tree

README.md

Lines changed: 13 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -49,46 +49,23 @@ pixi run -e analysis setup-r-packages
4949

5050
### Compare DivRef 1.1 against different gnomAD releases
5151

52-
1. Download the DivRef 1.1 index file.
52+
[DivRef 1.1](https://zenodo.org/records/14802613) states that:
5353

54-
```bash
55-
mkdir -p data/analysis/input
56-
wget -O data/analysis/input/DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb \
57-
https://zenodo.org/records/14802613/files/DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb
58-
```
54+
> DivRef is constructed by computing empirical phased haplotypes within 25 BPs over 0.5% allele frequency from the Human Genome Diversity Panel (HGDP) using the phased Hail dataset provided by the gnomAD team at the Broad Institute, merged with single variants over 0.5% AF from the gnomAD v4.1.0 summary release.
5955
60-
2. Extract locus, alleles, and default DivRef population AFs from the gnomAD 4.1 joint (exomes and genomes) call set and the gnomAD 3.1.2 HGDP+1KG subset call set. Restrict to `chr22`.
56+
Some gnomAD v4.1.0 variants that we expected to see represented in DivRef 1.1 were not.
57+
We checked all 'gnomAD_variant' variants on chr22 from DivRef 1.1 against the gnomAD 4.1 joint (exomes and genomes) variants and the gnomAD 3.1.2 HGDP+1KG subset variants.
58+
The latter is the source used for the haplotypes present in DivRef.
6159

6260
```bash
63-
pixi run divref extract-gnomad-single-afs \
64-
--contig chr22 \
65-
--gnomad-version JOINT_41 \
66-
--out-sites-hail-table data/analysis/input/chr22.joint_41.ht \
67-
--out-sites-tsv data/analysis/input/chr22.joint_41.tsv
68-
69-
pixi run divref extract-gnomad-single-afs \
70-
--contig chr22 \
71-
--gnomad-version HGDP_1KG_312 \
72-
--out-sites-hail-table data/analysis/input/chr22.hgdp_1kg_312.ht \
73-
--out-sites-tsv data/analysis/input/chr22.hgdp_1kg_312.ts
61+
pixi run -e analysis snakemake -j1 -s workflows/compare_divref_gnomad.smk
7462
```
7563

76-
3. Compare DivRef 1.1 chr22 gnomAD single variant sites to each of the two gnomAD releases.
64+
We found that all DivRef 1.1 variants were present in the gnomAD 3.1.2 HGDP+1KG subset, while 16 were missing from the gnomAD 4.1 joint set.
65+
We further compared the allele frequencies for the 5 populations recorded in the DivRef 1.1 DuckDB index against the frequencies for those populations in the two gnomAD sets, using the Hail tables as input.
7766

78-
```bash
79-
mkdir -p data/analysis/compare_divref_gnomad
80-
81-
pixi run Rscript scripts/compare_divref_gnomad.R \
82-
--contig chr22 \
83-
--divref_duckdb data/analysis/input/DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb \
84-
--gnomad_tsv data/analysis/input/chr22.joint_41.tsv \
85-
--gnomad_label "gnomAD 4.1 joint AF" \
86-
--output_base data/analysis/compare_divref_gnomad/chr22.joint_41
87-
88-
pixi run Rscript scripts/compare_divref_gnomad.R \
89-
--contig chr22 \
90-
--divref_duckdb data/analysis/input/DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb \
91-
--gnomad_tsv data/analysis/input/chr22.hgdp_1kg_312.tsv \
92-
--gnomad_label "gnomAD 3.1.2 HGDP+1KG AF" \
93-
--output_base data/analysis/compare_divref_gnomad/chr22.hgdp_1kg_312
94-
```
67+
The allele frequencies between DivRef 1.1 and the gnomAD 3.1.2 HGDP+1KG subset were very close, within a rounding error of 5e06, for all variants.
68+
69+
The allele frequencies between DivRef 1.1 and gnomAD 4.1 joint set were significantly different. 60,424 variants were found with an allele frequence difference >= 0.001.
70+
71+
We concluded that the DivRef 1.1 documentation was incorrect, and that the actual source of the gnomAD variants in the dataset was the gnomAD 3.1.2 HGDP+1KG subset, the same as for the haplotypes.
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
####################################################################################################
2+
# Compares DivRef 1.1 single-variant sites against allele frequencies from different gnomAD
3+
# releases for chr22.
4+
####################################################################################################
5+
6+
from pathlib import Path
7+
8+
####################################################################################################
9+
# Inputs / constants
10+
####################################################################################################
11+
12+
OUTPUT_DIR: Path = Path("data/analysis")
13+
CONTIG: str = "chr22"
14+
DIVREF_DUCKDB_URL: str = (
15+
"https://zenodo.org/records/14802613/files/" "DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb"
16+
)
17+
GNOMAD_VERSIONS: list[str] = ["joint_41", "hgdp_1kg_312"]
18+
19+
# Maps filename wildcard → plot label used by compare_divref_gnomad.R
20+
GNOMAD_LABEL: dict[str, str] = {
21+
"joint_41": "gnomAD 4.1 joint AF",
22+
"hgdp_1kg_312": "gnomAD 3.1.2 HGDP+1KG AF",
23+
}
24+
25+
####################################################################################################
26+
# Rules
27+
####################################################################################################
28+
29+
ruleorder: compare_divref_gnomad > extract_gnomad_single_afs
30+
31+
rule all:
32+
input:
33+
expand(
34+
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.af_diffs.png",
35+
gnomad_version=GNOMAD_VERSIONS,
36+
),
37+
expand(
38+
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.not_in_gnomad_afs.png",
39+
gnomad_version=GNOMAD_VERSIONS,
40+
),
41+
expand(
42+
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.divref_not_in_gnomad.tsv",
43+
gnomad_version=GNOMAD_VERSIONS,
44+
),
45+
expand(
46+
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.log",
47+
gnomad_version=GNOMAD_VERSIONS,
48+
)
49+
50+
51+
####################################################################################################
52+
# Downloads the DivRef 1.1 DuckDB index from Zenodo.
53+
####################################################################################################
54+
rule download_divref_index:
55+
output:
56+
duckdb=f"{OUTPUT_DIR}/input/DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb",
57+
log:
58+
f"logs/compare_divref_gnomad/download_divref_index.log",
59+
params:
60+
url=DIVREF_DUCKDB_URL,
61+
shell:
62+
"""
63+
(
64+
wget --no-verbose -O {output.duckdb} {params.url}
65+
) &> {log}
66+
"""
67+
68+
69+
####################################################################################################
70+
# Extracts chr22 allele frequencies from a gnomAD sites table and writes a Hail table and a flat
71+
# TSV with one AF column per population.
72+
####################################################################################################
73+
rule extract_gnomad_single_afs:
74+
output:
75+
ht=directory(f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.ht"),
76+
tsv=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.tsv",
77+
log:
78+
f"logs/compare_divref_gnomad/extract_gnomad_single_afs.{{gnomad_version}}.log",
79+
params:
80+
contig=CONTIG,
81+
gnomad_version=lambda wildcards: wildcards.gnomad_version.upper(),
82+
shell:
83+
"""
84+
(
85+
divref extract-gnomad-single-afs \
86+
--contig {params.contig} \
87+
--gnomad-version {params.gnomad_version} \
88+
--out-sites-hail-table {output.ht} \
89+
--out-sites-tsv {output.tsv}
90+
) &> {log}
91+
"""
92+
93+
94+
####################################################################################################
95+
# Compares DivRef 1.1 gnomAD_variant sites to the extracted gnomAD allele frequencies.
96+
# If all DivRef variants are found in gnomAD the R script exits early; touch ensures Snakemake's
97+
# output checks are satisfied even in that case.
98+
####################################################################################################
99+
rule compare_divref_gnomad:
100+
input:
101+
duckdb=f"{OUTPUT_DIR}/input/DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb",
102+
tsv=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.tsv",
103+
output:
104+
af_diffs_png=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.af_diffs.png",
105+
not_in_gnomad_png=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.not_in_gnomad_afs.png",
106+
not_in_gnomad_tsv=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.divref_not_in_gnomad.tsv",
107+
log:
108+
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.log",
109+
params:
110+
contig=CONTIG,
111+
gnomad_label=lambda wildcards: GNOMAD_LABEL[wildcards.gnomad_version],
112+
output_base=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}",
113+
shell:
114+
"""
115+
(
116+
Rscript scripts/compare_divref_gnomad.R \
117+
--contig {params.contig} \
118+
--divref_duckdb {input.duckdb} \
119+
--gnomad_tsv {input.tsv} \
120+
--gnomad_label '{params.gnomad_label}' \
121+
--output_base {params.output_base}
122+
123+
# The R script exits early when no DivRef variants are absent from gnomAD,
124+
# so touch ensures these outputs exist for Snakemake's file checks.
125+
touch {output.not_in_gnomad_png} {output.not_in_gnomad_tsv}
126+
) &> {log}
127+
"""

0 commit comments

Comments
 (0)