Skip to content

feat: add gnomAD 3.1.2 genomes to comparison#37

Open
ameynert wants to merge 5 commits intomainfrom
am_analysis2
Open

feat: add gnomAD 3.1.2 genomes to comparison#37
ameynert wants to merge 5 commits intomainfrom
am_analysis2

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented Apr 22, 2026

Summary by CodeRabbit

  • New Features

    • Added support for comparing DivRef against the gnomAD 3.1.2 genomes release
    • Produces an additional aggregated allele-frequency differences plot (saved as a separate output image)
  • Documentation

    • Expanded dataset-specific allele-frequency comparison results and corrected rounding threshold
    • Added a consistency check confirming DivRef variant count matches upstream extraction outputs

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6aab77b-fe1b-442a-b771-ca87b5e71755

📥 Commits

Reviewing files that changed from the base of the PR and between 775ca56 and 5eb1cc4.

📒 Files selected for processing (2)
  • divref/divref/tools/extract_gnomad_single_afs.py
  • workflows/compare_divref_gnomad.smk
📝 Walkthrough

Walkthrough

Adds support for the gnomAD 3.1.2 genomes dataset to the gnomAD AF extraction and comparison pipeline, renames existing extractor entrypoints for clarity, implements a new genomes extractor that filters and builds population-frequency arrays, and updates the workflow and reporting to include the new release.

Changes

Cohort / File(s) Summary
Documentation
README.md
Expanded comparison to list three gnomAD datasets (3.1.2 HGDP+1KG, 3.1.2 genomes, 4.1 joint), reported presence/absence counts, dataset-specific AF difference bullets (including corrected 5e-6 threshold), and a consistency check for total gnomAD_variant count (506,983).
gnomAD Extraction Tool
divref/divref/tools/extract_gnomad_single_afs.py
Added GnomadVersion.GENOMES_312, implemented extract_from_gnomad_312_genomes(...), renamed extractors to extract_from_gnomad_41_joint and extract_from_gnomad_312_hgdp_1kg, and updated dispatcher to route GENOMES_312. Key logic: read genomes sites table, map freq_meta → indices, select populations, filter by AF threshold, return keyed table with locus, alleles, pop_freqs.
Workflow & Rules
workflows/compare_divref_gnomad.smk
Added genomes_312 to GNOMAD_VERSIONS and GNOMAD_LABEL, and declared new per-release output *.af_diffs_all.png in rule compare_divref_gnomad and rule all.
Visualization Script
scripts/compare_divref_gnomad.R
Removed custom plot title; added aggregated histogram of AF differences across populations and save as additional *.af_diffs_all.png output.

Sequence Diagram

sequenceDiagram
    participant Workflow as Workflow Engine
    participant Dispatcher as extract_gnomad_single_afs<br/>(dispatcher)
    participant Genomes as extract_from_gnomad_312_genomes<br/>(new extractor)
    participant GCS as gnomAD 3.1.2 Genomes Table (GCS)
    participant Output as Result Table

    Workflow->>Dispatcher: invoke with GENOMES_312 + params
    Dispatcher->>Genomes: dispatch to genomes extractor
    Genomes->>GCS: read contig interval & site rows
    GCS-->>Genomes: return variant records
    Genomes->>Genomes: map freq_meta → indices
    Genomes->>Genomes: select populations & build pop_freqs
    Genomes->>Genomes: filter variants by freq_threshold
    Genomes->>Output: key & return (locus, alleles, pop_freqs)
    Output-->>Workflow: processed table for downstream steps
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐇 I hopped through tables, mapped freq_meta to keys,
I filtered by AF and chased contig-wise breeze.
Genomes joined the party, arrays snug in my paws,
Read, map, filter, return — neat output because:
Happy carrots for code, and a small rabbit applause 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add gnomAD 3.1.2 genomes to comparison' clearly and accurately summarizes the main change: adding support for gnomAD 3.1.2 genomes dataset to the comparison workflow, as evidenced by the new enum value, extractor function, workflow updates, and R script changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_analysis2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
divref/divref/tools/extract_gnomad_single_afs.py (1)

97-184: Consider factoring out the three near-identical extractors.

The three extract_from_gnomad_* functions differ only in (a) the table URI, (b) the globals field holding freq_meta (joint_globals.freq_meta vs. freq_meta vs. gnomad_freq_meta), (c) the row field for frequencies (joint.freq vs. freq vs. gnomad_freq), and (d) the pop-metadata key (gen_anc vs. pop). A small config table keyed on GnomadVersion plus one shared helper would eliminate the copy/paste (which already produced the docstring bug on line 133) and make adding future releases a one-line change. Not blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/extract_gnomad_single_afs.py` around lines 97 - 184, The
three near-identical extractors (extract_from_gnomad_41_joint,
extract_from_gnomad_312_genomes, extract_from_gnomad_312_hgdp_1kg) should be
collapsed into one shared helper driven by a small config mapping keyed by
GnomadVersion that provides the table URI, the globals field name that contains
freq_meta (e.g., "joint_globals.freq_meta" vs "freq_meta" vs
"gnomad_freq_meta"), the row frequency field path (e.g., "joint.freq" vs "freq"
vs "gnomad_freq"), and the pop-metadata key name ("gen_anc" vs "pop"); implement
a single function (e.g., extract_from_gnomad) that reads
hl.read_table(GnomadVersion.value), pulls freq_meta via the configured globals
path, builds pop_indices using the configured pop key, selects pop_freqs by
mapping the configured row frequency field, applies the freq_threshold filter,
and returns the same table shape, then replace the three specific functions with
thin wrappers or calls into that helper and update the docstring(s) to the
correct description.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 127-133: The docstring for extract_from_gnomad_312_genomes is
incorrect (it mentions "HGDP+1KG sites schema"); update it to correctly describe
that this function targets the gnomAD v3.1.2 genomes schema (not the HGDP+1KG
subset), reference the correct table URI and the genomes frequency fields
(freq_meta / freq rather than gnoma d_freq_meta / gnomad_freq), and note any
schema-specific behavior (e.g., how populations and freq_threshold are applied).
Ensure the docstring in the extract_from_gnomad_312_genomes function clearly
states it uses the gnomAD 3.1.2 genomes schema and lists the relevant fields
used by the implementation.

In `@README.md`:
- Line 75: The README contains a typo in the reported rounding-error magnitude:
replace the incorrect scientific notation "5e06" with "5e-06" so the statement
reads approximately 5×10⁻⁶ (matching expected gnomAD AF tolerance); update the
sentence mentioning "gnomAD 3.1.2 HGDP+1KG subset: within a rounding error of
5e06" to use "5e-06" instead.

---

Nitpick comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 97-184: The three near-identical extractors
(extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes,
extract_from_gnomad_312_hgdp_1kg) should be collapsed into one shared helper
driven by a small config mapping keyed by GnomadVersion that provides the table
URI, the globals field name that contains freq_meta (e.g.,
"joint_globals.freq_meta" vs "freq_meta" vs "gnomad_freq_meta"), the row
frequency field path (e.g., "joint.freq" vs "freq" vs "gnomad_freq"), and the
pop-metadata key name ("gen_anc" vs "pop"); implement a single function (e.g.,
extract_from_gnomad) that reads hl.read_table(GnomadVersion.value), pulls
freq_meta via the configured globals path, builds pop_indices using the
configured pop key, selects pop_freqs by mapping the configured row frequency
field, applies the freq_threshold filter, and returns the same table shape, then
replace the three specific functions with thin wrappers or calls into that
helper and update the docstring(s) to the correct description.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5df1715b-4f22-4fe7-a870-7074be6a54a9

📥 Commits

Reviewing files that changed from the base of the PR and between 4d5073c and 0a2b00d.

📒 Files selected for processing (3)
  • README.md
  • divref/divref/tools/extract_gnomad_single_afs.py
  • workflows/compare_divref_gnomad.smk

Comment thread divref/divref/tools/extract_gnomad_single_afs.py Outdated
Comment thread README.md Outdated
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 21:57 — with GitHub Actions Inactive
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 23:34 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
divref/divref/tools/extract_gnomad_single_afs.py (2)

100-100: Accept an ordered sequence instead of a concrete list.

The helpers only iterate over populations and preserve order, so Sequence[str] is the more general practical type.

Proposed type refinement
+from collections.abc import Sequence
 from enum import StrEnum
 from enum import unique
 from pathlib import Path
-    populations: list[str],
+    populations: Sequence[str],

As per coding guidelines, “For Python function parameters, accept the most general type practical (e.g., Iterable over List)”.

Also applies to: 130-130, 160-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/extract_gnomad_single_afs.py` at line 100, The parameter
type annotation for the parameter named "populations" is currently annotated as
list[str] in several functions; change these to accept a more general ordered
sequence by using Sequence[str] instead (update all occurrences around the
previously shown positions, e.g., the signatures that declare populations:
list[str]); also add the necessary import from typing (Sequence) at the top of
the module so the annotations remain valid. Ensure you update every function
signature where populations: list[str] appears (including the other two
locations mentioned) and run type checks.

97-184: Expand the public helper docstrings to Google style.

These renamed/new public helpers currently use one-line docstrings, but they return hl.Table and can raise ValueError for unknown populations. Please add Args:, Returns:, and Raises: blocks.

Proposed shape for one helper
 def extract_from_gnomad_312_genomes(
     contig: str,
     freq_threshold: float,
     populations: list[str],
     reference_genome: str,
 ) -> hl.Table:
-    """Use the gnomAD 3.1.2 genomes sites schema."""
+    """Extract population frequencies from the gnomAD 3.1.2 genomes sites schema.
+
+    Args:
+        contig: Contig interval to extract.
+        freq_threshold: Minimum allele frequency in any selected population to retain.
+        populations: Population codes to extract, in output order.
+        reference_genome: Reference genome name used to parse the interval.
+
+    Returns:
+        Hail table with `locus`, `alleles`, and `pop_freqs`.
+
+    Raises:
+        ValueError: If a requested population is not present in the frequency metadata.
+    """

As per coding guidelines, “Use Google-style docstrings with Args:, Returns:, Yields:, and Raises: blocks on all public functions and classes in Python” and “Update docstrings if function signature or behavior changed in Python code”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/extract_gnomad_single_afs.py` around lines 97 - 184,
Update the docstrings for the three public helper functions
extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes, and
extract_from_gnomad_312_hgdp_1kg to Google-style multi-line docstrings: include
an Args: section documenting contig (str), freq_threshold (float), populations
(list[str]), and reference_genome (str); a Returns: section specifying that the
function returns a hail Table (hl.Table) of loci with alleles and pop_freqs; and
a Raises: section noting that a ValueError is raised when a requested population
is not found in the gnomAD frequency metadata. Ensure the new docstrings replace
the existing one-line summaries and accurately reflect each function's specific
globals/field names (e.g., va.joint.freq, va.freq, va.gnomad_freq) where
relevant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/compare_divref_gnomad.R`:
- Around line 120-133: The y-axis label on the aggregate histogram built from
divref_in_gnomad_with_af_diffs after pivot_longer currently reads "Variants" but
actually counts variant-population AF-difference observations (each variant may
contribute up to five rows); update the ylab() call in the ggplot pipeline (the
object p created after pivot_longer and before ggsave) to a clearer label such
as "Variant-population observations" or "Variant-population observations (log
scale)" so the axis accurately describes the plotted counts.

---

Nitpick comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Line 100: The parameter type annotation for the parameter named "populations"
is currently annotated as list[str] in several functions; change these to accept
a more general ordered sequence by using Sequence[str] instead (update all
occurrences around the previously shown positions, e.g., the signatures that
declare populations: list[str]); also add the necessary import from typing
(Sequence) at the top of the module so the annotations remain valid. Ensure you
update every function signature where populations: list[str] appears (including
the other two locations mentioned) and run type checks.
- Around line 97-184: Update the docstrings for the three public helper
functions extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes, and
extract_from_gnomad_312_hgdp_1kg to Google-style multi-line docstrings: include
an Args: section documenting contig (str), freq_threshold (float), populations
(list[str]), and reference_genome (str); a Returns: section specifying that the
function returns a hail Table (hl.Table) of loci with alleles and pop_freqs; and
a Raises: section noting that a ValueError is raised when a requested population
is not found in the gnomAD frequency metadata. Ensure the new docstrings replace
the existing one-line summaries and accurately reflect each function's specific
globals/field names (e.g., va.joint.freq, va.freq, va.gnomad_freq) where
relevant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 048474bb-1dcb-4cb6-8f19-677cf1a6b1ca

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2b00d and 775ca56.

📒 Files selected for processing (4)
  • README.md
  • divref/divref/tools/extract_gnomad_single_afs.py
  • scripts/compare_divref_gnomad.R
  • workflows/compare_divref_gnomad.smk
🚧 Files skipped from review as they are similar to previous changes (2)
  • workflows/compare_divref_gnomad.smk
  • README.md

Comment on lines +120 to +133
p <- divref_in_gnomad_with_af_diffs %>%
select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>%
pivot_longer(
cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe),
names_to = "population", values_to = "diff_freq", names_prefix = "diff_"
) %>%
ggplot(aes(x = diff_freq)) +
geom_histogram() +
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variants")

ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the y-axis label for the aggregate histogram.

This new plot counts variant-population AF-difference observations, not distinct variants, because each variant contributes up to five rows after pivot_longer().

Proposed label tweak
-  ylab("Variants")
+  ylab("Variant-population comparisons")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
p <- divref_in_gnomad_with_af_diffs %>%
select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>%
pivot_longer(
cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe),
names_to = "population", values_to = "diff_freq", names_prefix = "diff_"
) %>%
ggplot(aes(x = diff_freq)) +
geom_histogram() +
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variants")
ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6)
p <- divref_in_gnomad_with_af_diffs %>%
select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>%
pivot_longer(
cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe),
names_to = "population", values_to = "diff_freq", names_prefix = "diff_"
) %>%
ggplot(aes(x = diff_freq)) +
geom_histogram() +
scale_y_log10() +
theme_bw() +
xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) +
ylab("Variant-population comparisons")
ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/compare_divref_gnomad.R` around lines 120 - 133, The y-axis label on
the aggregate histogram built from divref_in_gnomad_with_af_diffs after
pivot_longer currently reads "Variants" but actually counts variant-population
AF-difference observations (each variant may contribute up to five rows); update
the ylab() call in the ggplot pipeline (the object p created after pivot_longer
and before ggsave) to a clearer label such as "Variant-population observations"
or "Variant-population observations (log scale)" so the axis accurately
describes the plotted counts.

@ameynert ameynert deployed to github-actions-snakemake-linting April 22, 2026 23:44 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant