Skip to content

feat: scripts to support analysis of DivRef 1.1 against gnomAD 4.1 and 3.1.2 HGDP+1KG subset#34

Merged
ameynert merged 8 commits intomainfrom
am_analysis
Apr 22, 2026
Merged

feat: scripts to support analysis of DivRef 1.1 against gnomAD 4.1 and 3.1.2 HGDP+1KG subset#34
ameynert merged 8 commits intomainfrom
am_analysis

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented Apr 22, 2026

Summary

  • Adds extract-gnomad-single-afs, a new CLI tool supporting both gnomAD 4.1 joint and 3.1.2 HGDP+1KG schemas, selectable via --gnomad-version. Unlike the existing extract-gnomad-afs, this tool also writes a flat TSV (variant, per-population AFs) alongside the Hail table, for use in downstream R analysis.
  • Adds scripts/compare_divref_gnomad.R, a parameterised R script that compares DivRef 1.1 single-variant sites against a gnomAD allele frequency TSV. Accepts --contig, --divref_duckdb, --gnomad_tsv, --gnomad_label, and --output_base. Produces AF difference histograms, a not-in-gnomAD AF histogram, and a variant list TSV. Logs variant counts at each stage; exits early if all DivRef variants are present in gnomAD.
  • Restructures pixi.toml: promotes python and openjdk to top-level dependencies, splits workflow and analysis tooling into separate workflow and analysis features/environments, and adds R dependencies (r-base, r-optparse, r-tidyverse, r-logger) to the analysis environment.
  • Adds a setup-r-packages pixi task to install duckdb and duckplyr from CRAN (no conda-forge build available for osx-arm64).
  • Updates CLAUDE.md to fix inaccuracies in the tool pipeline order, shared module descriptions, and repository layout.
  • Documents the comparison analysis workflow end-to-end in README.md.

Summary by CodeRabbit

  • New Features

    • New CLI command to extract gnomAD allele-frequency data per contig and produce synchronized table + TSV outputs.
    • New end-to-end analysis workflow that compares DivRef 1.1 variant sites against multiple gnomAD releases and generates comparison plots and TSV reports of mismatches.
  • Documentation

    • Added an "Analysis" section with environment setup and step-by-step instructions to run the comparison workflow and view results.
  • Chores

    • Added an analysis environment and pinned R packages plus a PyArrow dependency to support the workflow.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 1 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 1 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fc5e5f5a-fa8c-483b-9c1a-8e1caae6de37

📥 Commits

Reviewing files that changed from the base of the PR and between e91d459 and 2cc5eab.

📒 Files selected for processing (3)
  • divref/divref/tools/extract_gnomad_single_afs.py
  • scripts/compare_divref_gnomad.R
  • workflows/compare_divref_gnomad.smk
📝 Walkthrough

Walkthrough

Adds a new CLI tool to extract gnomAD per-contig allele frequencies, a Snakemake workflow and R script to compare those AFs against a DivRef DuckDB index, and environment/dependency updates plus README documentation for running the end-to-end analysis.

Changes

Cohort / File(s) Summary
Documentation
README.md
Added an "Analysis" section documenting pixi environment setup, running the Snakemake workflow, commands used, and observed comparison outcomes between DivRef 1.1 and gnomAD releases.
CLI Tool Registration
divref/divref/main.py
Registered a new CLI subcommand by importing and adding extract_gnomad_single_afs to the tool registry.
gnomAD AF Extraction
divref/divref/tools/extract_gnomad_single_afs.py
New module adding GnomadVersion enum and extract_gnomad_single_afs(...) plus schema-specific extractors. Initializes Hail, reads gnomAD sites tables from GCS, validates populations, filters by contig and AF threshold, writes a Hail table and a TSV with variant and per-population AFs.
Workflow & Orchestration
workflows/compare_divref_gnomad.smk
New Snakemake workflow defining rules to download DivRef DuckDB, run the CLI extractor for multiple gnomAD versions, and invoke the R comparison script to produce plots, missing-variant TSVs, and logs.
Analysis Script
scripts/compare_divref_gnomad.R
New Rscript that reads DivRef from DuckDB and gnomAD TSV, coerces/fills AFs, joins by variant, computes per-population AF differences, writes plots, counts variants with
Environment Configuration
pixi.toml
Added pyarrow constraint to workflow deps; added a new analysis feature with pinned R packages and an analysis environment that enables workflow+analysis.

Sequence Diagram

sequenceDiagram
    participant User
    participant PixiCLI as "pixi / CLI"
    participant GCS as "GCS (gnomAD tables)"
    participant Hail as "Hail"
    participant FS as "Filesystem"
    participant DuckDB as "DivRef DuckDB"
    participant R as "Rscript"

    User->>PixiCLI: pixi run -e analysis setup-r-packages
    PixiCLI->>FS: install R packages

    User->>PixiCLI: pixi run -e analysis extract_gnomad_single_afs (version, contig)
    PixiCLI->>GCS: read gnomAD sites table (version)
    GCS->>Hail: stream sites table
    Hail->>Hail: filter contig, populations, freq_threshold
    Hail->>FS: write Hail table + export TSV

    User->>FS: download DivRef DuckDB index
    FS->>DuckDB: store index

    User->>R: Rscript compare_divref_gnomad.R (DuckDB, gnomAD TSV, contig)
    R->>DuckDB: query DivRef variants for contig
    R->>FS: read gnomAD TSV
    R->>R: join, compute AF diffs, generate plots, write missing-variants TSV
    R->>FS: write outputs (PNGs, TSVs, logs)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped through clouds of genomic streams,

pulling AFs and chasing dreams.
DuckDB met gnomAD on my trail,
R drew the plots, each tiny detail.
Hooray — a carrot for every matched tale!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and specifically describes the primary change: adding scripts to support analysis of DivRef 1.1 against gnomAD 4.1 and 3.1.2 HGDP+1KG subset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_analysis

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert changed the base branch from am_wf_haplotypes to am_wf_create_fasta April 22, 2026 16:36
@ameynert ameynert force-pushed the am_wf_create_fasta branch from 1f54eef to 51c3481 Compare April 22, 2026 17:24
Base automatically changed from am_wf_create_fasta to main April 22, 2026 17:35
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 17:43 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
divref/divref/tools/extract_gnomad_single_afs.py (1)

22-51: Document the new public Python APIs, or make the schema helpers private.

extract_gnomad_single_afs, extract_from_joint_41, and extract_from_hgdp_1kg_312 are all public names, but the new module only gives them partial/one-line docstrings. If the schema-specific helpers are internal, prefix them with _; otherwise please add full Google-style Args/Returns/Raises blocks.

As per coding guidelines, "Use Google-style docstrings with Args:, Returns:, Yields:, and Raises: blocks on all public functions and classes in Python".

Also applies to: 87-122

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/extract_gnomad_single_afs.py` around lines 22 - 51, The
three newly exported functions extract_gnomad_single_afs, extract_from_joint_41,
and extract_from_hgdp_1kg_312 currently have only one-line/partial docstrings;
either make them internal by renaming to _extract_gnomad_single_afs,
_extract_from_joint_41, and _extract_from_hgdp_1kg_312, or add full Google-style
docstrings for each public function including Args:, Returns:, and Raises:
sections that describe parameters (types and defaults), return values, and
possible exceptions (e.g., IO or schema errors) so they comply with the project
docstring guideline. Ensure the docstrings match the function signatures
(including freq_threshold, populations, reference_genome, gcs_credentials_path)
and update any references/imports if you change the exported names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 108-110: The filter comparisons use x.AF > freq_threshold which
excludes zero-AF variants when freq_threshold is 0; change the comparisons to
x.AF >= freq_threshold (update the va.filter call that references pop_freqs and
the similar occurrences around lines 137-139) so that AF == 0 is included when
freq_threshold is 0 while preserving expected behavior for other thresholds;
ensure you update all instances (e.g., where va.select_globals,
va.select(pop_freqs=...), and va.filter(...) are used) to use >= consistently.

In `@README.md`:
- Around line 69-74: The README example uses --out-sites-tsv pointing to
data/analysis/input/chr22.hgdp_1kg_312.ts (missing the "v"), which breaks the
next step that expects .tsv; update the command in the pixi run divref
extract-gnomad-single-afs example so the --out-sites-tsv argument uses
data/analysis/input/chr22.hgdp_1kg_312.tsv (add the "v") to match the downstream
reader.
- Around line 81-93: The commands invoke Rscript via "pixi run Rscript ..." but
Rscript is provided only by the analysis feature defined in pixi.toml, so update
the examples to explicitly run in that environment: run the same pixi run
commands but ensure they target the analysis feature/environment (e.g., invoke
the pixi feature/profile that enables Rscript as defined in pixi.toml) so that
"Rscript" is available when running the scripts; reference the existing "pixi
run", "Rscript" and "pixi.toml" entries to locate where to change the README
examples.

In `@scripts/compare_divref_gnomad.R`:
- Around line 43-48: The SQL currently interpolates opts$contig into sprintf
before calling dbGetQuery, which is vulnerable; instead, change the dbGetQuery
call to use a parameterized query (use a placeholder like ? for contig and bind
the value) and pass opts$contig via the params argument so the DB handles
quoting/escaping; update the call that currently references dbGetQuery, sprintf,
and opts$contig to use a parameterized statement (still filtering source =
'gnomAD_variant' or bind that value as well) so inputs from the CLI are not
spliced directly into the SQL.

---

Nitpick comments:
In `@divref/divref/tools/extract_gnomad_single_afs.py`:
- Around line 22-51: The three newly exported functions
extract_gnomad_single_afs, extract_from_joint_41, and extract_from_hgdp_1kg_312
currently have only one-line/partial docstrings; either make them internal by
renaming to _extract_gnomad_single_afs, _extract_from_joint_41, and
_extract_from_hgdp_1kg_312, or add full Google-style docstrings for each public
function including Args:, Returns:, and Raises: sections that describe
parameters (types and defaults), return values, and possible exceptions (e.g.,
IO or schema errors) so they comply with the project docstring guideline. Ensure
the docstrings match the function signatures (including freq_threshold,
populations, reference_genome, gcs_credentials_path) and update any
references/imports if you change the exported names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1b3907d-a424-43ad-b570-bf7c7f3872c1

📥 Commits

Reviewing files that changed from the base of the PR and between d3897ad and b342877.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • README.md
  • divref/divref/main.py
  • divref/divref/tools/extract_gnomad_single_afs.py
  • pixi.toml
  • scripts/compare_divref_gnomad.R

Comment thread divref/divref/tools/extract_gnomad_single_afs.py Outdated
Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread scripts/compare_divref_gnomad.R Outdated
@ameynert ameynert had a problem deploying to github-actions-snakemake-linting April 22, 2026 18:14 — with GitHub Actions Failure
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 18:15 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
workflows/compare_divref_gnomad.smk (1)

73-91: Make population columns explicit to avoid schema drift.

The extraction call currently depends on extractor defaults. scripts/compare_divref_gnomad.R (provided snippet, Line 49) uses a fixed population set (afr, amr, eas, sas, nfe), so explicitly passing that set here would make this workflow reproducible if tool defaults change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflows/compare_divref_gnomad.smk` around lines 73 - 91, The rule
extract_gnomad_single_afs currently relies on extractor defaults for population
columns; update its shell command to explicitly pass the population set used by
scripts/compare_divref_gnomad.R by adding the populations argument (e.g.,
--populations "afr,amr,eas,sas,nfe") to the divref extract-gnomad-single-afs
invocation, or add a new params entry (e.g., populations="afr,amr,eas,sas,nfe")
and interpolate it into the shell command so the extraction always produces the
afr, amr, eas, sas, nfe columns regardless of tool defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@workflows/compare_divref_gnomad.smk`:
- Around line 54-66: The rule download_divref_index currently writes the remote
binary directly to the output path ({output.duckdb}) without integrity checks;
change it to download to a temporary file, fetch the corresponding checksum or
signature (e.g., {params.url}.sha256 or .asc), verify the artifact using
sha256sum -c or gpg/openssl as appropriate, and only move the verified temp file
to {output.duckdb} on success; ensure verification failures are logged to {log}
and cause the shell command to exit non‑zero so the workflow fails fast (update
the shell block that references params.url, output.duckdb, and log accordingly).

---

Nitpick comments:
In `@workflows/compare_divref_gnomad.smk`:
- Around line 73-91: The rule extract_gnomad_single_afs currently relies on
extractor defaults for population columns; update its shell command to
explicitly pass the population set used by scripts/compare_divref_gnomad.R by
adding the populations argument (e.g., --populations "afr,amr,eas,sas,nfe") to
the divref extract-gnomad-single-afs invocation, or add a new params entry
(e.g., populations="afr,amr,eas,sas,nfe") and interpolate it into the shell
command so the extraction always produces the afr, amr, eas, sas, nfe columns
regardless of tool defaults.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1fc949b-bcd6-462d-9bbd-ef070bdd750d

📥 Commits

Reviewing files that changed from the base of the PR and between b342877 and e91d459.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • README.md
  • pixi.toml
  • scripts/compare_divref_gnomad.R
  • workflows/compare_divref_gnomad.smk
✅ Files skipped from review due to trivial changes (2)
  • README.md
  • pixi.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/compare_divref_gnomad.R

Comment thread workflows/compare_divref_gnomad.smk
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 18:17 — with GitHub Actions Inactive
@ameynert ameynert merged commit 4d5073c into main Apr 22, 2026
4 checks passed
@ameynert ameynert deleted the am_analysis branch April 22, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant