Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,18 +54,28 @@ pixi run -e analysis setup-r-packages
> DivRef is constructed by computing empirical phased haplotypes within 25 BPs over 0.5% allele frequency from the Human Genome Diversity Panel (HGDP) using the phased Hail dataset provided by the gnomAD team at the Broad Institute, merged with single variants over 0.5% AF from the gnomAD v4.1.0 summary release.

Some gnomAD v4.1.0 variants that we expected to see represented in DivRef 1.1 were not.
We checked all 'gnomAD_variant' variants on chr22 from DivRef 1.1 against the gnomAD 4.1 joint (exomes and genomes) variants and the gnomAD 3.1.2 HGDP+1KG subset variants.
The latter is the source used for the haplotypes present in DivRef.
We checked all 'gnomAD_variant' variants on chr22 from DivRef 1.1 against:

- gnomAD 3.1.2 HGDP+1KG subset
- gnomAD 3.1.2 genomes (~76K genomes)
- gnomAD 4.1 joint (~730K exomes and ~76K genomes)

gnomAD 3.1.2 HGDP+1KG subset is the source used for the haplotypes present in DivRef 1.1.

```bash
pixi run -e analysis snakemake -j1 -s workflows/compare_divref_gnomad.smk
```

We found that all DivRef 1.1 variants were present in the gnomAD 3.1.2 HGDP+1KG subset, while 16 were missing from the gnomAD 4.1 joint set.
**DivRef 1.1 variants present in gnomAD datasets**

We found that all DivRef 1.1 'gnomAD_variant' variants were present in the gnomAD 3.1.2 HGDP+1KG subset and in the gnomAD 3.1.2 genomes set, while 16 were missing from the gnomAD 4.1 joint set.

We further compared the allele frequencies for the 5 populations recorded in the DivRef 1.1 DuckDB index against the frequencies for those populations in the two gnomAD sets, using the Hail tables as input.

The allele frequencies between DivRef 1.1 and the gnomAD 3.1.2 HGDP+1KG subset were very close, within a rounding error of 5e06, for all variants.
- gnomAD 3.1.2 HGDP+1KG subset: within a rounding error of 5e-6, for all variants.
- gnomAD 3.1.2 genomes: 1,400 variants with an AF difference >= 0.001, all of which had lower AF in the genomes set compared to HGDP+1KG, indicating more stringent filtering in the genomes set
- gnomAD 4.1 joint: 60,424 variants with an AF difference >= 0.001, evenly distributed both lower and higher AF

The allele frequencies between DivRef 1.1 and gnomAD 4.1 joint set were significantly different. 60,424 variants were found with an allele frequence difference >= 0.001.
There are 506,983 DivRef 1.1 'gnomAD_variant' variants. When we run the original DivRef script [extract_gnomad_afs.py](https://github.com/e9genomics/human-diversity-reference/blob/main/scripts/extract_gnomad_afs.py) (uses gnomAD 3.1.2 HGDP+1KG sites as input, hard-coded) followed by lines 156-177 of [create_fasta_and_index.py](https://github.com/e9genomics/human-diversity-reference/blob/main/scripts/create_fasta_and_index.py) (lines 156-177) using the parameters specified in the [Makefile](https://github.com/e9genomics/human-diversity-reference/blob/main/Makefile), we also get 506,983 variants.

We concluded that the DivRef 1.1 documentation was incorrect, and that the actual source of the gnomAD variants in the dataset was the gnomAD 3.1.2 HGDP+1KG subset, the same as for the haplotypes.
48 changes: 44 additions & 4 deletions divref/divref/tools/extract_gnomad_single_afs.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ class GnomadVersion(StrEnum):
"""Mapping of gnomAD versions to GCS sites tables."""

JOINT_41 = "gs://gcp-public-data--gnomad/release/4.1/ht/joint/gnomad.joint.v4.1.sites.ht"
GENOMES_312 = (
"gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht"
)
HGDP_1KG_312 = "gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_variant_annotations.ht"


Expand Down Expand Up @@ -54,14 +57,21 @@ def extract_gnomad_single_afs(
ht: hl.Table
match gnomad_version:
case GnomadVersion.JOINT_41:
ht = extract_from_joint_41(
ht = extract_from_gnomad_41_joint(
contig=contig,
freq_threshold=freq_threshold,
populations=populations,
reference_genome=reference_genome,
)
case GnomadVersion.GENOMES_312:
ht = extract_from_gnomad_312_genomes(
contig=contig,
freq_threshold=freq_threshold,
populations=populations,
reference_genome=reference_genome,
)
case GnomadVersion.HGDP_1KG_312:
ht = extract_from_hgdp_1kg_312(
ht = extract_from_gnomad_312_hgdp_1kg(
contig=contig,
freq_threshold=freq_threshold,
populations=populations,
Expand All @@ -84,7 +94,7 @@ def extract_gnomad_single_afs(
).export(str(out_sites_tsv))


def extract_from_joint_41(
def extract_from_gnomad_41_joint(
contig: str,
freq_threshold: float,
populations: list[str],
Expand Down Expand Up @@ -114,7 +124,37 @@ def extract_from_joint_41(
return va.select("locus", "alleles", "pop_freqs")


def extract_from_hgdp_1kg_312(
def extract_from_gnomad_312_genomes(
contig: str,
freq_threshold: float,
populations: list[str],
reference_genome: str,
) -> hl.Table:
"""Use the gnomAD 3.1.2 genomes sites schema."""
va_all = hl.read_table(GnomadVersion.GENOMES_312.value)
interval = hl.parse_locus_interval(contig, reference_genome=reference_genome)
va = hl.filter_intervals(va_all, [interval])

freq_meta = va.globals.freq_meta.collect()[0]
map_to_index = {to_hashable_items(x): i for i, x in enumerate(freq_meta)}

pop_indices = []
for pop in populations:
idx = map_to_index.get(to_hashable_items({"group": "adj", "pop": pop}))
if idx is None:
raise ValueError(f"Population {pop!r} not found in gnomAD frequency metadata")
pop_indices.append(idx)

va = va.select_globals(pops=populations)
va = va.select(pop_freqs=hl.literal(pop_indices).map(lambda i: va.freq[i]))
if freq_threshold > 0:
va = va.filter(hl.any(lambda x: x.AF > freq_threshold, va.pop_freqs))
va = va.key_by()

return va.select("locus", "alleles", "pop_freqs")


def extract_from_gnomad_312_hgdp_1kg(
contig: str,
freq_threshold: float,
populations: list[str],
Expand Down
18 changes: 16 additions & 2 deletions scripts/compare_divref_gnomad.R
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,25 @@ p <- divref_in_gnomad_with_af_diffs %>%
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variants") +
labs(title = paste0("DivRef 1.1 'gnomAD_variant' variants found in ", opts$gnomad_label))
ylab("Variants")

ggsave(paste0(opts$output_base, ".af_diffs.png"), p, height = 10, width = 6)

p <- divref_in_gnomad_with_af_diffs %>%
select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>%
pivot_longer(
cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe),
names_to = "population", values_to = "diff_freq", names_prefix = "diff_"
) %>%
ggplot(aes(x = diff_freq)) +
geom_histogram() +
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variants")

ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6)
Comment on lines +120 to +133
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the y-axis label for the aggregate histogram.

This new plot counts variant-population AF-difference observations, not distinct variants, because each variant contributes up to five rows after pivot_longer().

Proposed label tweak
-  ylab("Variants")
+  ylab("Variant-population comparisons")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
p <- divref_in_gnomad_with_af_diffs %>%
select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>%
pivot_longer(
cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe),
names_to = "population", values_to = "diff_freq", names_prefix = "diff_"
) %>%
ggplot(aes(x = diff_freq)) +
geom_histogram() +
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variants")
ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6)
p <- divref_in_gnomad_with_af_diffs %>%
select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>%
pivot_longer(
cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe),
names_to = "population", values_to = "diff_freq", names_prefix = "diff_"
) %>%
ggplot(aes(x = diff_freq)) +
geom_histogram() +
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variant-population comparisons")
ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/compare_divref_gnomad.R` around lines 120 - 133, The y-axis label on
the aggregate histogram built from divref_in_gnomad_with_af_diffs after
pivot_longer currently reads "Variants" but actually counts variant-population
AF-difference observations (each variant may contribute up to five rows); update
the ylab() call in the ggplot pipeline (the object p created after pivot_longer
and before ggsave) to a clearer label such as "Variant-population observations"
or "Variant-population observations (log scale)" so the axis accurately
describes the plotted counts.


# Count: variants with large AF differences ----

n_large_af_diff <- divref_in_gnomad_with_af_diffs %>%
Expand Down
8 changes: 7 additions & 1 deletion workflows/compare_divref_gnomad.smk
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,12 @@ CONTIG: str = "chr22"
DIVREF_DUCKDB_URL: str = (
"https://zenodo.org/records/14802613/files/" "DivRef-v1.1.haplotypes_gnomad_merge.index.duckdb"
)
GNOMAD_VERSIONS: list[str] = ["joint_41", "hgdp_1kg_312"]
GNOMAD_VERSIONS: list[str] = ["joint_41", "genomes_312", "hgdp_1kg_312"]

# Maps filename wildcard → plot label used by compare_divref_gnomad.R
GNOMAD_LABEL: dict[str, str] = {
"joint_41": "gnomAD 4.1 joint AF",
"genomes_312": "gnomAD 3.1.2 genomes AF",
"hgdp_1kg_312": "gnomAD 3.1.2 HGDP+1KG AF",
}

Expand All @@ -36,6 +37,10 @@ rule all:
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.af_diffs.png",
gnomad_version=GNOMAD_VERSIONS,
),
expand(
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.af_diffs_all.png",
gnomad_version=GNOMAD_VERSIONS,
),
expand(
f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.not_in_gnomad_afs.png",
gnomad_version=GNOMAD_VERSIONS,
Expand Down Expand Up @@ -104,6 +109,7 @@ rule compare_divref_gnomad:
tsv=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.tsv",
output:
af_diffs_png=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.af_diffs.png",
af_diffs_all_png=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.af_diffs_all.png",
not_in_gnomad_png=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.not_in_gnomad_afs.png",
not_in_gnomad_tsv=f"{OUTPUT_DIR}/compare_divref_gnomad/{CONTIG}.{{gnomad_version}}.divref_not_in_gnomad.tsv",
log:
Expand Down
Loading