Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 41 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds support for the gnomAD 3.1.2 genomes dataset to the gnomAD AF extraction and comparison pipeline, renames existing extractor entrypoints for clarity, implements a new genomes extractor that filters and builds population-frequency arrays, and updates the workflow and reporting to include the new release. Changes
Sequence DiagramsequenceDiagram
participant Workflow as Workflow Engine
participant Dispatcher as extract_gnomad_single_afs<br/>(dispatcher)
participant Genomes as extract_from_gnomad_312_genomes<br/>(new extractor)
participant GCS as gnomAD 3.1.2 Genomes Table (GCS)
participant Output as Result Table
Workflow->>Dispatcher: invoke with GENOMES_312 + params
Dispatcher->>Genomes: dispatch to genomes extractor
Genomes->>GCS: read contig interval & site rows
GCS-->>Genomes: return variant records
Genomes->>Genomes: map freq_meta → indices
Genomes->>Genomes: select populations & build pop_freqs
Genomes->>Genomes: filter variants by freq_threshold
Genomes->>Output: key & return (locus, alleles, pop_freqs)
Output-->>Workflow: processed table for downstream steps
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
divref/divref/tools/extract_gnomad_single_afs.py (1)
97-184: Consider factoring out the three near-identical extractors.The three
extract_from_gnomad_*functions differ only in (a) the table URI, (b) the globals field holdingfreq_meta(joint_globals.freq_metavs.freq_metavs.gnomad_freq_meta), (c) the row field for frequencies (joint.freqvs.freqvs.gnomad_freq), and (d) the pop-metadata key (gen_ancvs.pop). A small config table keyed onGnomadVersionplus one shared helper would eliminate the copy/paste (which already produced the docstring bug on line 133) and make adding future releases a one-line change. Not blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/extract_gnomad_single_afs.py` around lines 97 - 184, The three near-identical extractors (extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes, extract_from_gnomad_312_hgdp_1kg) should be collapsed into one shared helper driven by a small config mapping keyed by GnomadVersion that provides the table URI, the globals field name that contains freq_meta (e.g., "joint_globals.freq_meta" vs "freq_meta" vs "gnomad_freq_meta"), the row frequency field path (e.g., "joint.freq" vs "freq" vs "gnomad_freq"), and the pop-metadata key name ("gen_anc" vs "pop"); implement a single function (e.g., extract_from_gnomad) that reads hl.read_table(GnomadVersion.value), pulls freq_meta via the configured globals path, builds pop_indices using the configured pop key, selects pop_freqs by mapping the configured row frequency field, applies the freq_threshold filter, and returns the same table shape, then replace the three specific functions with thin wrappers or calls into that helper and update the docstring(s) to the correct description.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 127-133: The docstring for extract_from_gnomad_312_genomes is
incorrect (it mentions "HGDP+1KG sites schema"); update it to correctly describe
that this function targets the gnomAD v3.1.2 genomes schema (not the HGDP+1KG
subset), reference the correct table URI and the genomes frequency fields
(freq_meta / freq rather than gnoma d_freq_meta / gnomad_freq), and note any
schema-specific behavior (e.g., how populations and freq_threshold are applied).
Ensure the docstring in the extract_from_gnomad_312_genomes function clearly
states it uses the gnomAD 3.1.2 genomes schema and lists the relevant fields
used by the implementation.
In `@README.md`:
- Line 75: The README contains a typo in the reported rounding-error magnitude:
replace the incorrect scientific notation "5e06" with "5e-06" so the statement
reads approximately 5×10⁻⁶ (matching expected gnomAD AF tolerance); update the
sentence mentioning "gnomAD 3.1.2 HGDP+1KG subset: within a rounding error of
5e06" to use "5e-06" instead.
---
Nitpick comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 97-184: The three near-identical extractors
(extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes,
extract_from_gnomad_312_hgdp_1kg) should be collapsed into one shared helper
driven by a small config mapping keyed by GnomadVersion that provides the table
URI, the globals field name that contains freq_meta (e.g.,
"joint_globals.freq_meta" vs "freq_meta" vs "gnomad_freq_meta"), the row
frequency field path (e.g., "joint.freq" vs "freq" vs "gnomad_freq"), and the
pop-metadata key name ("gen_anc" vs "pop"); implement a single function (e.g.,
extract_from_gnomad) that reads hl.read_table(GnomadVersion.value), pulls
freq_meta via the configured globals path, builds pop_indices using the
configured pop key, selects pop_freqs by mapping the configured row frequency
field, applies the freq_threshold filter, and returns the same table shape, then
replace the three specific functions with thin wrappers or calls into that
helper and update the docstring(s) to the correct description.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5df1715b-4f22-4fe7-a870-7074be6a54a9
📒 Files selected for processing (3)
README.mddivref/divref/tools/extract_gnomad_single_afs.pyworkflows/compare_divref_gnomad.smk
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
divref/divref/tools/extract_gnomad_single_afs.py (2)
100-100: Accept an ordered sequence instead of a concrete list.The helpers only iterate over
populationsand preserve order, soSequence[str]is the more general practical type.Proposed type refinement
+from collections.abc import Sequence from enum import StrEnum from enum import unique from pathlib import Path- populations: list[str], + populations: Sequence[str],As per coding guidelines, “For Python function parameters, accept the most general type practical (e.g.,
IterableoverList)”.Also applies to: 130-130, 160-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/extract_gnomad_single_afs.py` at line 100, The parameter type annotation for the parameter named "populations" is currently annotated as list[str] in several functions; change these to accept a more general ordered sequence by using Sequence[str] instead (update all occurrences around the previously shown positions, e.g., the signatures that declare populations: list[str]); also add the necessary import from typing (Sequence) at the top of the module so the annotations remain valid. Ensure you update every function signature where populations: list[str] appears (including the other two locations mentioned) and run type checks.
97-184: Expand the public helper docstrings to Google style.These renamed/new public helpers currently use one-line docstrings, but they return
hl.Tableand can raiseValueErrorfor unknown populations. Please addArgs:,Returns:, andRaises:blocks.Proposed shape for one helper
def extract_from_gnomad_312_genomes( contig: str, freq_threshold: float, populations: list[str], reference_genome: str, ) -> hl.Table: - """Use the gnomAD 3.1.2 genomes sites schema.""" + """Extract population frequencies from the gnomAD 3.1.2 genomes sites schema. + + Args: + contig: Contig interval to extract. + freq_threshold: Minimum allele frequency in any selected population to retain. + populations: Population codes to extract, in output order. + reference_genome: Reference genome name used to parse the interval. + + Returns: + Hail table with `locus`, `alleles`, and `pop_freqs`. + + Raises: + ValueError: If a requested population is not present in the frequency metadata. + """As per coding guidelines, “Use Google-style docstrings with
Args:,Returns:,Yields:, andRaises:blocks on all public functions and classes in Python” and “Update docstrings if function signature or behavior changed in Python code”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/extract_gnomad_single_afs.py` around lines 97 - 184, Update the docstrings for the three public helper functions extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes, and extract_from_gnomad_312_hgdp_1kg to Google-style multi-line docstrings: include an Args: section documenting contig (str), freq_threshold (float), populations (list[str]), and reference_genome (str); a Returns: section specifying that the function returns a hail Table (hl.Table) of loci with alleles and pop_freqs; and a Raises: section noting that a ValueError is raised when a requested population is not found in the gnomAD frequency metadata. Ensure the new docstrings replace the existing one-line summaries and accurately reflect each function's specific globals/field names (e.g., va.joint.freq, va.freq, va.gnomad_freq) where relevant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/compare_divref_gnomad.R`:
- Around line 120-133: The y-axis label on the aggregate histogram built from
divref_in_gnomad_with_af_diffs after pivot_longer currently reads "Variants" but
actually counts variant-population AF-difference observations (each variant may
contribute up to five rows); update the ylab() call in the ggplot pipeline (the
object p created after pivot_longer and before ggsave) to a clearer label such
as "Variant-population observations" or "Variant-population observations (log
scale)" so the axis accurately describes the plotted counts.
---
Nitpick comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Line 100: The parameter type annotation for the parameter named "populations"
is currently annotated as list[str] in several functions; change these to accept
a more general ordered sequence by using Sequence[str] instead (update all
occurrences around the previously shown positions, e.g., the signatures that
declare populations: list[str]); also add the necessary import from typing
(Sequence) at the top of the module so the annotations remain valid. Ensure you
update every function signature where populations: list[str] appears (including
the other two locations mentioned) and run type checks.
- Around line 97-184: Update the docstrings for the three public helper
functions extract_from_gnomad_41_joint, extract_from_gnomad_312_genomes, and
extract_from_gnomad_312_hgdp_1kg to Google-style multi-line docstrings: include
an Args: section documenting contig (str), freq_threshold (float), populations
(list[str]), and reference_genome (str); a Returns: section specifying that the
function returns a hail Table (hl.Table) of loci with alleles and pop_freqs; and
a Raises: section noting that a ValueError is raised when a requested population
is not found in the gnomAD frequency metadata. Ensure the new docstrings replace
the existing one-line summaries and accurately reflect each function's specific
globals/field names (e.g., va.joint.freq, va.freq, va.gnomad_freq) where
relevant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 048474bb-1dcb-4cb6-8f19-677cf1a6b1ca
📒 Files selected for processing (4)
README.mddivref/divref/tools/extract_gnomad_single_afs.pyscripts/compare_divref_gnomad.Rworkflows/compare_divref_gnomad.smk
🚧 Files skipped from review as they are similar to previous changes (2)
- workflows/compare_divref_gnomad.smk
- README.md
| p <- divref_in_gnomad_with_af_diffs %>% | ||
| select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>% | ||
| pivot_longer( | ||
| cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe), | ||
| names_to = "population", values_to = "diff_freq", names_prefix = "diff_" | ||
| ) %>% | ||
| ggplot(aes(x = diff_freq)) + | ||
| geom_histogram() + | ||
| scale_y_log10() + | ||
| theme_bw() + | ||
| xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) + | ||
| ylab("Variants") | ||
|
|
||
| ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6) |
There was a problem hiding this comment.
Clarify the y-axis label for the aggregate histogram.
This new plot counts variant-population AF-difference observations, not distinct variants, because each variant contributes up to five rows after pivot_longer().
Proposed label tweak
- ylab("Variants")
+ ylab("Variant-population comparisons")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| p <- divref_in_gnomad_with_af_diffs %>% | |
| select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>% | |
| pivot_longer( | |
| cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe), | |
| names_to = "population", values_to = "diff_freq", names_prefix = "diff_" | |
| ) %>% | |
| ggplot(aes(x = diff_freq)) + | |
| geom_histogram() + | |
| scale_y_log10() + | |
| theme_bw() + | |
| xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) + | |
| ylab("Variants") | |
| ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6) | |
| p <- divref_in_gnomad_with_af_diffs %>% | |
| select(variants, diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe) %>% | |
| pivot_longer( | |
| cols = c(diff_afr, diff_amr, diff_eas, diff_sas, diff_nfe), | |
| names_to = "population", values_to = "diff_freq", names_prefix = "diff_" | |
| ) %>% | |
| ggplot(aes(x = diff_freq)) + | |
| geom_histogram() + | |
| scale_y_log10() + | |
| theme_bw() + | |
| xlab(paste0(opts$gnomad_label, " - DivRef 1.1 AF")) + | |
| ylab("Variant-population comparisons") | |
| ggsave(paste0(opts$output_base, ".af_diffs_all.png"), p, height = 6, width = 6) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/compare_divref_gnomad.R` around lines 120 - 133, The y-axis label on
the aggregate histogram built from divref_in_gnomad_with_af_diffs after
pivot_longer currently reads "Variants" but actually counts variant-population
AF-difference observations (each variant may contribute up to five rows); update
the ylab() call in the ggplot pipeline (the object p created after pivot_longer
and before ggsave) to a clearer label such as "Variant-population observations"
or "Variant-population observations (log scale)" so the axis accurately
describes the plotted counts.
Summary by CodeRabbit
New Features
Documentation