Skip to content

Commit 759453d

Browse files
viszegithub-actions[bot]BennyKrup
authored
feat(Assignment): Long read support with pbmm2 mapper (#247)
* chore(master): release 0.6.0 * Add long-read barcode assignment via pbmm2 * Add pbmm2 support and update QC report logic for long-read assignments * Add pbmm2 and pysam support in Dockerfile with new conda environment * Refactor pbmm2 rules to ensure consistent conda environment usage and streamline parameter definitions * using linker instead of pattern * Add pyproject.toml for snakefmt configuration and update Snakemake rules for improved logging and parameter handling * updating docs * Enable summary saving for Super Linter in GitHub Actions workflow * snakefmt * add strand_sensitive * test * Add required field for enable in alignment tool and adjust strand_sensitive requirement * Set default value for strand_sensitive.enable to false in config schema * fas * fasdfsda * fasdfasd * fasf * Add resource configuration for assignment_mapping_pbmm2_align * Remove unused output generation for counts in all_experiments rule --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: BennyKrup <krupkinbenyamin@gmail.com>
1 parent 451a049 commit 759453d

19 files changed

Lines changed: 592 additions & 257 deletions

File tree

.github/workflows/main.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,9 @@ jobs:
4040
VALIDATE_YAML: true
4141
YAML_CONFIG_FILE: .yamllint.yml
4242
VALIDATE_SNAKEMAKE_SNAKEFMT: true
43+
SNAKEMAKE_SNAKEFMT_CONFIG_FILE: pyproject.toml
4344
VALIDATE_R: true
45+
SAVE_SUPER_LINTER_SUMMARY: true
4446

4547
Linting:
4648
runs-on: ubuntu-latest

.release-please-manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
".": "0.5.9"
2+
".": "0.6.0"
33
}

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,33 @@
11
# Changelog
22

3+
## [0.6.0](https://github.com/kircherlab/MPRAsnakeflow/compare/v0.5.9...v0.6.0) (2026-02-11)
4+
5+
6+
### ⚠ BREAKING CHANGES
7+
8+
* renaming output files to use dots instead as undersocres a file separators ([#239](https://github.com/kircherlab/MPRAsnakeflow/issues/239))
9+
* assignment adapter removal by length. Assignment config, adapter and forward read (now FWD), changed ([#237](https://github.com/kircherlab/MPRAsnakeflow/issues/237))
10+
11+
### Features
12+
13+
* assignment adapter removal by length. Assignment config, adapter and forward read (now FWD), changed ([#237](https://github.com/kircherlab/MPRAsnakeflow/issues/237)) ([521735e](https://github.com/kircherlab/MPRAsnakeflow/commit/521735e9b6911114bb8382fc4e7bac4dcab89b5f))
14+
* configurable bwa ([#244](https://github.com/kircherlab/MPRAsnakeflow/issues/244)) ([9550e22](https://github.com/kircherlab/MPRAsnakeflow/commit/9550e223bf66f3aeeef0c249148962b571a2f61b))
15+
* enhance trimming functionality and update config schema for adapter specifications ([798cebb](https://github.com/kircherlab/MPRAsnakeflow/commit/798cebb8ad24b6d2b38816522fe86072d1b0df04))
16+
* experiment adapter trimming and option to do BC (also UMI if available) selection from end of the read (FWD only) ([#238](https://github.com/kircherlab/MPRAsnakeflow/issues/238)) ([04dd683](https://github.com/kircherlab/MPRAsnakeflow/commit/04dd6831d243bf22508b390b7c1926f744eb4759))
17+
* fastq-join as option for merging reads (assignment workflow) ([#243](https://github.com/kircherlab/MPRAsnakeflow/issues/243)) ([093e288](https://github.com/kircherlab/MPRAsnakeflow/commit/093e288fd9a6f38df2be2f7e25ae18ecba0d3f7a))
18+
* implement adapter trimming functionality in experiment rules ([9fd32ce](https://github.com/kircherlab/MPRAsnakeflow/commit/9fd32cee97dc69217f88f994249ea92ed0dd5b5e))
19+
20+
21+
### Bug Fixes
22+
23+
* correct parameter name in check_version function ([7cd50a5](https://github.com/kircherlab/MPRAsnakeflow/commit/7cd50a52f3eaa8baae4d2d6937219b58c492d8d7))
24+
* snakemake reverted default value handling ([#236](https://github.com/kircherlab/MPRAsnakeflow/issues/236)) ([fa5109b](https://github.com/kircherlab/MPRAsnakeflow/commit/fa5109baacc8252c9f407c9cc54e080ca72e32e4))
25+
26+
27+
### Code Refactoring
28+
29+
* renaming output files to use dots instead as undersocres a file separators ([#239](https://github.com/kircherlab/MPRAsnakeflow/issues/239)) ([0546082](https://github.com/kircherlab/MPRAsnakeflow/commit/0546082a2edca83566dcb2283db61c423533524f))
30+
331
## [0.5.9](https://github.com/kircherlab/MPRAsnakeflow/compare/v0.5.8...v0.5.9) (2026-01-07)
432

533

Dockerfile

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,20 @@ COPY workflow/envs/quarto.yaml /conda-envs/b933cc1aa7c25db04635e7ec0e37f80e/envi
192192
RUN mkdir -p /conda-envs/1891509f8d9a8a89487739b14cd6dbef
193193
COPY workflow/envs/mpralib.yaml /conda-envs/1891509f8d9a8a89487739b14cd6dbef/environment.yaml
194194

195+
# Conda environment:
196+
# source: workflow/envs/pbmm2_pysam.yaml
197+
# prefix: /conda-envs/2308b21c334f9613fdb840777a17d2b9
198+
# ---
199+
# channels:
200+
# - conda-forge
201+
# - bioconda
202+
# dependencies:
203+
# - pbmm2
204+
# - pysam
205+
# - biopython
206+
# - python>=3.10
207+
RUN mkdir -p /conda-envs/2308b21c334f9613fdb840777a17d2b9
208+
COPY workflow/envs/pbmm2_pysam.yaml /conda-envs/2308b21c334f9613fdb840777a17d2b9/environment.yaml
195209

196210
# Step 2: Generate conda environments
197211

@@ -214,6 +228,7 @@ RUN conda env create --no-default-packages --prefix /conda-envs/a4e1b935cbca52df
214228
RUN conda env create --no-default-packages --prefix /conda-envs/b933cc1aa7c25db04635e7ec0e37f80e --file /conda-envs/b933cc1aa7c25db04635e7ec0e37f80e/environment.yaml
215229
RUN conda env create --no-default-packages --prefix /conda-envs/ae3e37bf43cbb30416a885168e10c552 --file /conda-envs/ae3e37bf43cbb30416a885168e10c552/environment.yaml
216230
RUN conda env create --no-default-packages --prefix /conda-envs/1891509f8d9a8a89487739b14cd6dbef --file /conda-envs/1891509f8d9a8a89487739b14cd6dbef/environment.yaml
231+
RUN conda env create --no-default-packages --prefix /conda-envs/2308b21c334f9613fdb840777a17d2b9 --file /conda-envs/2308b21c334f9613fdb840777a17d2b9/environment.yaml
217232

218233
# cleanup when version changed
219234
ARG VERSION
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
version: "0.6"
3+
4+
assignments:
5+
exampleLongRead:
6+
bc_length: 15
7+
long_read_input: resources/long_read/CEBPRE_10k.bam
8+
design_file: resources/long_read/CEBPRE_reference.fasta
9+
alignment_tool:
10+
tool: pbmm2
11+
configs:
12+
preset: SUBREAD
13+
min_concordance: 0.9
14+
alignment_start: 1
15+
sequence_length: 303
16+
linker: GCAAAGTGAACACATCGCTAAGCGAAAGCTAAG # linker sequence in the read after we expect the BC
17+
configs:
18+
test:
19+
min_support: 1
20+
fraction: 0.51

docs/1_getting_started/config.rst

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,13 @@ For each assignment you want to process, you must give it a name like :code:`exa
4545
:split_number:
4646
To parallelize mapping for assignment, the reads are split into :code:`split_number` files. For example, setting it to 300 means that the reads are split into 300 files, and each file is mapped in parallel. This is only useful when using a cluster. When running the workflow on a single machine, the default value should be used. The default is set to :code:`1`. (For technical reasons, when multiple assignments are defined, all will be set to the maximum defined in the config.)
4747
:tool:
48-
Alignment tool that is used. Currently, :code:`bbmap`, :code:`bwa`, :code:`bwa-additional-filtering`, and :code:`exact` are supported. Default is :code:`bbmap`.
48+
Alignment tool that is used. Currently, :code:`bbmap`, :code:`bwa`, :code:`bwa-additional-filtering`, :code:`exact`, and :code:`pbmm2` are supported. Default is :code:`bbmap`.
4949
:configs:
5050
Configurations of the alignment tool selected.
5151

52-
:sequence_length (exact, bbmap):
52+
:sequence_length (exact, bbmap, pbmm2):
5353
Defines the :code:`sequence_length`, which is the length of a sequence alignment to an oligo in the design file. Only one length design is supported.
54-
:alignment_start (exact, bbmap):
54+
:alignment_start (exact, bbmap, pbmm2):
5555
Defines the start of the alignment in an oligo. When using adapters, you must set the length of the adapter. Otherwise, 1 will be the choice for most cases.
5656
:sequence_length (bwa, bwa-additional-filtering):
5757
Defines the :code:`min` and :code:`max` of a :code:`sequence_length` specification. :code:`sequence_length` is the length of a sequence alignment to an oligo in the design file. Because there can be insertions and deletions, we recommend varying it slightly around the exact length (e.g., ±5). This option enables designs with multiple sequence lengths.
@@ -69,15 +69,19 @@ For each assignment you want to process, you must give it a name like :code:`exa
6969
(Optional) Threshold of mismatches we investigate if we should try to rescue. Default is :code:`3`.
7070
:verbose (bwa-additional-filtering):
7171
(Optional) Print which alignments were rescued and which could not be rescued. Default is :code:`false`.
72+
:preset (pbmm2):
73+
(Optional) Preset for pbmm2 alignment. Default is :code:`SUBREAD`.
74+
:min_concordance (pbmm2):
75+
(Optional) Minimum concordance for pbmm2 alignment. Default is :code:`0.9`.
7276

7377
:bc_length:
7478
Length of the barcode. Must match the length of :code:`BC`.
7579
:BC_rev_comp:
7680
(Optional) If set to :code:`true`, the barcode is reverse complemented. Default is :code:`false`.
7781
:linker_length:
78-
(Optional) Length of the linker. Only needed if you don't have a barcode read and the barcode is in the forward read with the structure: BC+Linker+Insert. The fixed length is used for the linker after a fixed length of BC. The recommended option is :code:`linker` by defining the exact linker sequence and using cutadapt for trimming.
82+
(Optional) Length of the linker. O nly needed if you don't have a barcode read and the barcode is in the forward read with the structure: BC+Linker+Insert. The fixed length is used for the linker after a fixed length of BC. The recommended option is :code:`linker` by defining the exact linker sequence and using cutadapt for trimming.
7983
:linker:
80-
(Optional) Length of the linker. Only needed if you don't have a barcode read and the barcode is in the forward read with the structure: BC+Linker+Insert. Uses cutadapt to trim the linker to get the barcode as well as the start of the insert.
84+
(Required for long read, otherwise optional) The exact linker between BC and oligo. *Short read data:* Only needed if you don't have a barcode read and the barcode is in the forward read with the structure: BC+Linker+Insert. Uses cutadapt to trim the linker to get the barcode as well as the start of the insert. *Long read data:* Required! BC will be taken after the linker.
8185
:FWD:
8286
List of forward-read files in gzipped fastq format. The full or relative path to the files should be used. The same order in FWD, BC, and REV is important.
8387
:REV:

docs/2_workflows/assignment.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,12 @@ Example of an assignment file using exact matches and read 1 with BC, linker, an
6464
.. literalinclude:: ../../config/example_assignment_exact_linker.yaml
6565
:language: yaml
6666

67-
If you want to use the strand sensitivity option (e.g., testing enhancers in both directions), you can add the following to the config file: :code:`strand_sensitive: {enable: true}`. Otherwise, MPRAsnakeflow will give you an error because it cannot handle the same sequences in both sense and antisense directions. This is an issue with the mappers because they do not consider the strand and will always call your read ambiguous due to multiple matches.
67+
Example of an assignment file using long read data with pbmm2 mapping:
68+
69+
.. literalinclude:: ../../config/example_assignment_pbmm2.yaml
70+
:language: yaml
71+
72+
If you want to use the strand sensitivity option (e.g., testing enhancers in both directions), you can add the following to the config file: :code:`strand_sensitive: {enable: true}`. Otherwise, MPRAsnakeflow will give you an error because it cannot handle the same sequences in both sense and antisense directions. This is an issue with the mappers because they do not consider the strand and will always call your read ambiguous due to multiple matches. **Not available for long read data.**
6873

6974
Snakemake
7075
============================
@@ -118,6 +123,9 @@ Rules run by Snakemake in the assignment utility:
118123
- **assignment_mapping_bwa_ref**: Create mapping reference for BWA from design file.
119124
- **assignment_mapping_exact**: Map the reads to the reference and sort using exact match.
120125
- **assignment_mapping_exact_reference**: Create reference to map the exact design
126+
- **assignment_mapping_pbmm2_align**: Align long reads (BAM or FASTA) to reference using pbmm2.
127+
- **assignment_mapping_pbmm2_getBCs**: Extract barcodes from aligned long reads. Produces the standard barcode TSV for downstream collection and filtering.
128+
- **assignment_mapping_pbmm2_index**: Create pbmm2 index from design reference.
121129
- **assignment_merge_NGmerge**: Merge the FWD, REV and BC fastq files into one using NGmerge.
122130
- **assignment_merge_fastqjoin**: Merge the FWD, REV and BC fastq files into one using fastq-join.
123131
- **assignment_preprocessing_adapter_remove**: Remove adapter sequence from the reads (3' or 5'). Uses cutadapt to trim adapters based on the primer direction.

profiles/default/config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ set-resources:
4242
assignment_mapping_bbmap:
4343
runtime: 240
4444
mem: 10G
45+
assignment_mapping_pbmm2_align:
46+
runtime: 240
4547
assignment_collect:
4648
runtime: 2160
4749
mem: 10G

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[tool.snakefmt]
2+
line_length = 127

resources/long_read/CEBPRE_10k.bam

4.34 MB
Binary file not shown.

0 commit comments

Comments
 (0)