feat: scripts to support analysis of DivRef 1.1 against gnomAD 4.1 and 3.1.2 HGDP+1KG subset#34
feat: scripts to support analysis of DivRef 1.1 against gnomAD 4.1 and 3.1.2 HGDP+1KG subset#34
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 1 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds a new CLI tool to extract gnomAD per-contig allele frequencies, a Snakemake workflow and R script to compare those AFs against a DivRef DuckDB index, and environment/dependency updates plus README documentation for running the end-to-end analysis. Changes
Sequence DiagramsequenceDiagram
participant User
participant PixiCLI as "pixi / CLI"
participant GCS as "GCS (gnomAD tables)"
participant Hail as "Hail"
participant FS as "Filesystem"
participant DuckDB as "DivRef DuckDB"
participant R as "Rscript"
User->>PixiCLI: pixi run -e analysis setup-r-packages
PixiCLI->>FS: install R packages
User->>PixiCLI: pixi run -e analysis extract_gnomad_single_afs (version, contig)
PixiCLI->>GCS: read gnomAD sites table (version)
GCS->>Hail: stream sites table
Hail->>Hail: filter contig, populations, freq_threshold
Hail->>FS: write Hail table + export TSV
User->>FS: download DivRef DuckDB index
FS->>DuckDB: store index
User->>R: Rscript compare_divref_gnomad.R (DuckDB, gnomAD TSV, contig)
R->>DuckDB: query DivRef variants for contig
R->>FS: read gnomAD TSV
R->>R: join, compute AF diffs, generate plots, write missing-variants TSV
R->>FS: write outputs (PNGs, TSVs, logs)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
1f54eef to
51c3481
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
divref/divref/tools/extract_gnomad_single_afs.py (1)
22-51: Document the new public Python APIs, or make the schema helpers private.
extract_gnomad_single_afs,extract_from_joint_41, andextract_from_hgdp_1kg_312are all public names, but the new module only gives them partial/one-line docstrings. If the schema-specific helpers are internal, prefix them with_; otherwise please add full Google-styleArgs/Returns/Raisesblocks.As per coding guidelines, "Use Google-style docstrings with
Args:,Returns:,Yields:, andRaises:blocks on all public functions and classes in Python".Also applies to: 87-122
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/extract_gnomad_single_afs.py` around lines 22 - 51, The three newly exported functions extract_gnomad_single_afs, extract_from_joint_41, and extract_from_hgdp_1kg_312 currently have only one-line/partial docstrings; either make them internal by renaming to _extract_gnomad_single_afs, _extract_from_joint_41, and _extract_from_hgdp_1kg_312, or add full Google-style docstrings for each public function including Args:, Returns:, and Raises: sections that describe parameters (types and defaults), return values, and possible exceptions (e.g., IO or schema errors) so they comply with the project docstring guideline. Ensure the docstrings match the function signatures (including freq_threshold, populations, reference_genome, gcs_credentials_path) and update any references/imports if you change the exported names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 108-110: The filter comparisons use x.AF > freq_threshold which
excludes zero-AF variants when freq_threshold is 0; change the comparisons to
x.AF >= freq_threshold (update the va.filter call that references pop_freqs and
the similar occurrences around lines 137-139) so that AF == 0 is included when
freq_threshold is 0 while preserving expected behavior for other thresholds;
ensure you update all instances (e.g., where va.select_globals,
va.select(pop_freqs=...), and va.filter(...) are used) to use >= consistently.
In `@README.md`:
- Around line 69-74: The README example uses --out-sites-tsv pointing to
data/analysis/input/chr22.hgdp_1kg_312.ts (missing the "v"), which breaks the
next step that expects .tsv; update the command in the pixi run divref
extract-gnomad-single-afs example so the --out-sites-tsv argument uses
data/analysis/input/chr22.hgdp_1kg_312.tsv (add the "v") to match the downstream
reader.
- Around line 81-93: The commands invoke Rscript via "pixi run Rscript ..." but
Rscript is provided only by the analysis feature defined in pixi.toml, so update
the examples to explicitly run in that environment: run the same pixi run
commands but ensure they target the analysis feature/environment (e.g., invoke
the pixi feature/profile that enables Rscript as defined in pixi.toml) so that
"Rscript" is available when running the scripts; reference the existing "pixi
run", "Rscript" and "pixi.toml" entries to locate where to change the README
examples.
In `@scripts/compare_divref_gnomad.R`:
- Around line 43-48: The SQL currently interpolates opts$contig into sprintf
before calling dbGetQuery, which is vulnerable; instead, change the dbGetQuery
call to use a parameterized query (use a placeholder like ? for contig and bind
the value) and pass opts$contig via the params argument so the DB handles
quoting/escaping; update the call that currently references dbGetQuery, sprintf,
and opts$contig to use a parameterized statement (still filtering source =
'gnomAD_variant' or bind that value as well) so inputs from the CLI are not
spliced directly into the SQL.
---
Nitpick comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 22-51: The three newly exported functions
extract_gnomad_single_afs, extract_from_joint_41, and extract_from_hgdp_1kg_312
currently have only one-line/partial docstrings; either make them internal by
renaming to _extract_gnomad_single_afs, _extract_from_joint_41, and
_extract_from_hgdp_1kg_312, or add full Google-style docstrings for each public
function including Args:, Returns:, and Raises: sections that describe
parameters (types and defaults), return values, and possible exceptions (e.g.,
IO or schema errors) so they comply with the project docstring guideline. Ensure
the docstrings match the function signatures (including freq_threshold,
populations, reference_genome, gcs_credentials_path) and update any
references/imports if you change the exported names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d1b3907d-a424-43ad-b570-bf7c7f3872c1
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
README.mddivref/divref/main.pydivref/divref/tools/extract_gnomad_single_afs.pypixi.tomlscripts/compare_divref_gnomad.R
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
workflows/compare_divref_gnomad.smk (1)
73-91: Make population columns explicit to avoid schema drift.The extraction call currently depends on extractor defaults.
scripts/compare_divref_gnomad.R(provided snippet, Line 49) uses a fixed population set (afr,amr,eas,sas,nfe), so explicitly passing that set here would make this workflow reproducible if tool defaults change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workflows/compare_divref_gnomad.smk` around lines 73 - 91, The rule extract_gnomad_single_afs currently relies on extractor defaults for population columns; update its shell command to explicitly pass the population set used by scripts/compare_divref_gnomad.R by adding the populations argument (e.g., --populations "afr,amr,eas,sas,nfe") to the divref extract-gnomad-single-afs invocation, or add a new params entry (e.g., populations="afr,amr,eas,sas,nfe") and interpolate it into the shell command so the extraction always produces the afr, amr, eas, sas, nfe columns regardless of tool defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@workflows/compare_divref_gnomad.smk`:
- Around line 54-66: The rule download_divref_index currently writes the remote
binary directly to the output path ({output.duckdb}) without integrity checks;
change it to download to a temporary file, fetch the corresponding checksum or
signature (e.g., {params.url}.sha256 or .asc), verify the artifact using
sha256sum -c or gpg/openssl as appropriate, and only move the verified temp file
to {output.duckdb} on success; ensure verification failures are logged to {log}
and cause the shell command to exit non‑zero so the workflow fails fast (update
the shell block that references params.url, output.duckdb, and log accordingly).
---
Nitpick comments:
In `@workflows/compare_divref_gnomad.smk`:
- Around line 73-91: The rule extract_gnomad_single_afs currently relies on
extractor defaults for population columns; update its shell command to
explicitly pass the population set used by scripts/compare_divref_gnomad.R by
adding the populations argument (e.g., --populations "afr,amr,eas,sas,nfe") to
the divref extract-gnomad-single-afs invocation, or add a new params entry
(e.g., populations="afr,amr,eas,sas,nfe") and interpolate it into the shell
command so the extraction always produces the afr, amr, eas, sas, nfe columns
regardless of tool defaults.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f1fc949b-bcd6-462d-9bbd-ef070bdd750d
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
README.mdpixi.tomlscripts/compare_divref_gnomad.Rworkflows/compare_divref_gnomad.smk
✅ Files skipped from review due to trivial changes (2)
- README.md
- pixi.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/compare_divref_gnomad.R
Summary
extract-gnomad-single-afs, a new CLI tool supporting both gnomAD 4.1 joint and 3.1.2 HGDP+1KG schemas, selectable via--gnomad-version. Unlike the existingextract-gnomad-afs, this tool also writes a flat TSV (variant, per-population AFs) alongside the Hail table, for use in downstream R analysis.scripts/compare_divref_gnomad.R, a parameterised R script that compares DivRef 1.1 single-variant sites against a gnomAD allele frequency TSV. Accepts--contig,--divref_duckdb,--gnomad_tsv,--gnomad_label, and--output_base. Produces AF difference histograms, a not-in-gnomAD AF histogram, and a variant list TSV. Logs variant counts at each stage; exits early if all DivRef variants are present in gnomAD.pixi.toml: promotespythonandopenjdkto top-level dependencies, splits workflow and analysis tooling into separateworkflowandanalysisfeatures/environments, and adds R dependencies (r-base,r-optparse,r-tidyverse,r-logger) to theanalysisenvironment.setup-r-packagespixi task to installduckdbandduckplyrfrom CRAN (no conda-forge build available for osx-arm64).CLAUDE.mdto fix inaccuracies in the tool pipeline order, shared module descriptions, and repository layout.README.md.Summary by CodeRabbit
New Features
Documentation
Chores