Skip to content

feat: create FASTA and index#35

Merged
ameynert merged 7 commits intomainfrom
am_wf_create_fasta
Apr 22, 2026
Merged

feat: create FASTA and index#35
ameynert merged 7 commits intomainfrom
am_wf_create_fasta

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented Apr 22, 2026

Summary by CodeRabbit

  • New Features

    • Workflow now produces per-chromosome DivRef FASTA and DuckDB index files and integrates automatic reference download and indexing.
  • Configuration

    • Added configurable parameters: reference genome source, reference base name, allele-frequency filter, sequence window size, version, and temporary directory path; schema now requires version.
  • Bug Fixes / Reliability

    • Improved checks for presence/readability of reference index and tightened temporary-directory handling; fixed population-frequency field usage during index creation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR adds reference genome download/indexing and final per-chromosome outputs to the DivRef workflow, tightens tool validation and tmp_dir checks, updates schema/config with new properties, and adds samtools/pyarrow dependencies and three Snakemake rules to fetch, index, and process the reference FASTA.

Changes

Cohort / File(s) Summary
Tool validation & field refs
divref/divref/tools/create_fasta_and_index.py
Replaced assert_fasta_indexed() with direct .fai readability check; removed Hail table rename and switched annotation field usage from empirical_AN to empirical_AC; changed tmp dir check from writable to existence before hl.init().
Workflow rules & parameters
workflows/generate_divref.smk
Added global config-derived variables and replaced prior rule all targets with per-chromosome DivRef FASTA/DuckDB outputs; added download_reference_genome, index_reference_genome, and create_fasta_and_index rules wiring reference URI, .fai, and tool invocation.
Configuration files
workflows/config/config.yml, workflows/config/config_schema.yml
Added version: "0.1-dev" to config; extended schema with reference_genome_uri, reference_genome_base_name, hgdp_1kg_min_estimated_gnomad_haplotype_af, sequence_window_size, version (required), and tmp_dir with defaults and constraints.
Dependencies / manifest
pixi.toml
Added samtools = ">=1.23.1,<2" and pyarrow = ">=23.0.1,<24" to the snakemake-minimal feature dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant SM as Snakemake
    participant Fetch as DownloadReference
    participant SAM as SamtoolsIndexer
    participant Tool as CreateFastaAndIndex
    participant Out as Outputs

    SM->>Fetch: start (reference_genome_uri)
    Fetch->>Fetch: gsutil cp + gunzip
    Fetch-->>SM: reference.fasta

    SM->>SAM: start (reference.fasta)
    SAM->>SAM: samtools faidx -> reference.fai
    SAM-->>SM: reference.fai

    SM->>Tool: start (reference.fasta, reference.fai, haplotypes.ht, variants.ht, params)
    Tool->>Tool: assert_path_is_readable(reference.fai)
    Tool->>Tool: build per-chr FASTA + DuckDB index (use empirical_AC)
    Tool-->>Out: per-chr FASTA + DuckDB indexes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐇 I hopped to fetch a fasta from the cloud,
I indexed the bases smart and proud,
Per-chromosome files, tidy and neat,
ACs aligned, tmp checks complete,
A carrot-sized pipeline, ready to sprout. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately reflects the main addition: new workflow rules to create FASTA and index files as outputs of the workflow.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_wf_create_fasta

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 17:19 — with GitHub Actions Inactive
@ameynert ameynert changed the title WIP: create FASTA and index feat: create FASTA and index Apr 22, 2026
@ameynert ameynert marked this pull request as ready for review April 22, 2026 17:19
Base automatically changed from am_wf_haplotypes to main April 22, 2026 17:23
@ameynert ameynert force-pushed the am_wf_create_fasta branch from 1f54eef to 51c3481 Compare April 22, 2026 17:24
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 17:24 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@workflows/generate_divref.smk`:
- Line 21: The VERSION variable is incorrectly annotated as int; change its
annotation to str so it matches the config schema and values (replace "VERSION:
int = config[\"version\"]" with "VERSION: str = config[\"version\"]"), and if
any downstream code expects an int either convert where used or add an explicit
cast/validation (e.g., str(config["version"]) or typing.cast(str,
config["version"])) so static type checkers and runtime behavior remain
consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe6a4230-7ba0-4201-a7d8-85073576691e

📥 Commits

Reviewing files that changed from the base of the PR and between 89e871e and 51c3481.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • divref/divref/tools/create_fasta_and_index.py
  • pixi.toml
  • workflows/config/config.yml
  • workflows/config/config_schema.yml
  • workflows/generate_divref.smk

Comment thread workflows/generate_divref.smk Outdated
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 22, 2026 17:31 — with GitHub Actions Inactive
@ameynert ameynert merged commit ce8f7a4 into main Apr 22, 2026
3 of 4 checks passed
@ameynert ameynert deleted the am_wf_create_fasta branch April 22, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant