Skip to content

Commit b77b3ce

Browse files
committed
chore: update claude.md
1 parent 01c4407 commit b77b3ce

1 file changed

Lines changed: 30 additions & 15 deletions

File tree

CLAUDE.md

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ uv run --directory divref mypy divref/ # Type-check only
3030
pixi run fix-and-check-all # Fix and check toolkit + Snakemake linting
3131
pixi run lint --check # Validate Snakemake files with snakefmt
3232
pixi run download # Run the Snakemake download workflow
33+
pixi run setup-gcs # Download GCS connector JAR (required once for Hail on GCS)
3334
```
3435

3536
## Architecture
@@ -40,13 +41,17 @@ pixi run download # Run the Snakemake download workflow
4041
divref/ # Python package (uv-managed)
4142
divref/
4243
main.py # CLI entry point; registers tools in _tools list
43-
haplotype.py # Shared Hail utilities (HailPath alias, haplotype helpers)
44+
alias.py # HailPath type alias (str; accepts local, gs://, hdfs://)
45+
defaults.py # Package-wide constants: POPULATIONS, REFERENCE_GENOME, freq thresholds
46+
hail.py # Hail initialization with GCS connector setup
47+
haplotype.py # Shared Hail utilities for haplotype sequence/windowing
4448
tools/ # One module per CLI subcommand
4549
tests/ # pytest tests
4650
pyproject.toml # Package deps, ruff/mypy/pytest config
4751
workflows/ # Snakemake workflows
48-
download.smk # Template download workflow
49-
config/config.yml # Workflow configuration
52+
generate_divref.smk # Main workflow (extract → haplotypes → reference download)
53+
create_test_data.smk # Generates gnomAD subset for unit tests
54+
config/config.yml # Workflow configuration (chromosomes, populations, paths)
5055
pixi.toml # Workspace config (snakemake + hail environments)
5156
```
5257

@@ -64,23 +69,33 @@ divref <tool-name> --arg value # Invokes the registered tool
6469
### Tool Pipeline (execution order)
6570

6671
The tools implement a data pipeline:
67-
1. `create_gnomad_sites_vcf` → gnomAD sites VCF
68-
2. `extract_gnomad_afs` → allele frequencies
69-
3. `compute_haplotypes` → groups variants into haplotype windows using Hail
70-
4. `compute_haplotype_statistics` → haplotype distributions
71-
5. `compute_variation_ratios` → variant pattern statistics
72-
6. `create_fasta_and_index` → outputs FASTA + DuckDB index (final deliverable)
73-
7. `remap_divref` → maps haplotype coordinates to reference genome
72+
1. `extract_gnomad_afs` / `extract_gnomad_single_afs` → per-population allele frequency Hail table
73+
2. `extract_sample_metadata` → simplified sample→population mapping table
74+
3. `create_gnomad_sites_vcf` → VCF of variants above AF threshold (uses output of step 1)
75+
4. `compute_haplotypes` → groups phased variants into haplotype windows using Hail
76+
5. `compute_haplotype_statistics` → haplotype count distributions
77+
6. `compute_variation_ratios` → per-sample variant counts at multiple freq thresholds
78+
7. `create_fasta_and_index` → FASTA sequences + DuckDB index (final deliverable)
79+
8. `remap_divref` → maps haplotype coordinates back to reference genome (post-CALITAS step)
7480

75-
### Key Shared Module: `haplotype.py`
81+
`extract_gnomad_single_afs` is an alternative to `extract_gnomad_afs` supporting both gnomAD v4.1 (JOINT) and v3.1.2 (HGDP+1KG) table schemas.
7682

77-
- `HailPath = str` — type alias for paths accepted by Hail (local, `gs://`, `hdfs://`)
78-
- `get_haplo_sequence(context_size, variants)` — builds haplotype sequence strings with flanking reference context
79-
- `split_haplotypes(ht, window_size)` — splits multi-variant haplotypes at gaps > `window_size` bases
83+
### Key Shared Modules
84+
85+
**`haplotype.py`**
86+
- `get_haplo_sequence(context_size, variants)` — builds haplotype sequence strings with flanking reference context; handles SNPs, insertions, deletions
87+
- `split_haplotypes(ht, window_size)` — splits multi-variant haplotypes at gaps ≥ `window_size` bases; discards sub-haplotypes with <2 variants
88+
- `variant_distance(v1, v2)` — reference bases between two variants (accounts for indel length)
89+
90+
**`compute_haplotypes.py` two-window strategy**: To avoid systematic edge artefacts, the tool runs two overlapping window passes (offset by `window_size / 2`) and unions the results. Intermediate `.1.ht` / `.2.ht` files are cleaned up after the merge.
91+
92+
**`hail.py`**: `hail_init(gcs_credentials_path)` — sets `GOOGLE_APPLICATION_CREDENTIALS`, verifies GCS connector JAR (installed via `pixi run setup-gcs`), then calls `hl.init()` with Spark GCS config.
93+
94+
**`defaults.py`**: `POPULATIONS`, `REFERENCE_GENOME`, `VARIATION_RATIO_FREQUENCY_THRESHOLDS` — defaults shared across tools.
8095

8196
### Data Models (`remap_divref.py`)
8297

83-
Pydantic `frozen=True` models: `Variant`, `ReferenceMapping`, `Haplotype` — used for type-safe coordinate remapping.
98+
Pydantic `frozen=True` models: `Variant`, `ReferenceMapping`, `Haplotype` — used for type-safe coordinate remapping. `Haplotype` uses field aliases to match mixedCase column names in the DuckDB index created by `create_fasta_and_index`.
8499

85100
## Git Workflow
86101

0 commit comments

Comments
 (0)