perf: parallel pread for SE/PE uncompressed FASTQ input by KimBioInfoStudio · Pull Request #674 · OpenGene/fastp

KimBioInfoStudio · 2026-03-27T04:29:57Z

Summary

Parallel pread: Worker threads read uncompressed FASTQ input via pread(2) in parallel, bypassing the single reader thread bottleneck. A FastqChunkIndex scans the file once to build pack-aligned byte offsets, then each worker atomically grabs the next pack index, preads its chunk, and parses it into reads.
Raw FASTQ pwrite: Extends the existing parallel pwrite(2) output (previously gz-only) to raw .fq output. Worker threads write directly to non-overlapping file regions, bypassing the single writer thread.
Ordered output: WriterThread gains inputWithSeq(tid, data, seq) which uses pack index as sequence number — for pwrite mode via the offset ring, for non-pwrite mode (stdout) via a new ordered ring buffer drained by the writer thread.

Commits

perf: extend pwrite parallel write to raw FASTQ output — raw fq pwrite + hybrid spin backoff
perf: SE parallel pread with ordered output — SE parallel read + ordered writer mode
perf: PE parallel pread with ordered output — PE dual-file parallel read + all 7 writers ordered

When parallel pread activates

Input must be uncompressed (not .gz)
Input must be regular files (not stdin)
Not in split mode
No --reads_to_process limit
PE: not interleaved input; R1/R2 must have same read count

Falls back to the existing sequential reader otherwise.

Benchmark (Apple M4 Pro, 10M PE reads, `-w 14`)

Mode	Baseline	Optimized	Speedup	Verify
se-fq-gz	14.04s	13.68s	1.03x	PASS
se-fq-fq	12.86s	12.88s	1.00x	PASS

Small -w shows no benefit on fast NVMe — the reader thread is not the bottleneck at low thread counts. Requesting benchmark on high-core-count x86 systems where the single reader thread may become a bottleneck with large -w.

@sfchen Could you benchmark this on a high-core x86 machine with -w 32 or higher on large uncompressed FASTQ files? The parallel pread should show more benefit when the sequential reader becomes the bottleneck.

Test plan

SE fq→fq output matches baseline byte-for-byte
SE fq→gz output matches baseline (decompressed content)
SE fq→stdout output matches baseline
PE fq→fq output matches baseline byte-for-byte (R1 + R2)
Small file (10 reads, -w 8) completes without hang
PE mode regression (sequential reader path unchanged)

🤖 Generated with Claude Code

Previously pwrite mode only activated for .gz output (parallel libdeflate compression). Now any non-stdout multi-threaded output uses pwrite: worker threads write directly to non-overlapping file regions via pwrite(2), bypassing the single writer thread. For .gz: compress with libdeflate then pwrite (existing behavior). For .fq: pwrite raw output bytes directly (new). Also adds hybrid backoff to the pwrite spin-wait: yield for 256 iterations then usleep(1), preventing CPU burn when threads wait on predecessor sequence numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Worker threads use pread(2) to read chunks from uncompressed FASTQ files in parallel, bypassing the single reader thread bottleneck. Each thread atomically grabs the next pack index via fetch_add, reads its chunk with pread, parses it, and processes it. WriterThread gains ordered-mode (setOrderedMode/inputWithSeq): - pwrite path: uses pack index as sequence number in offset ring - non-pwrite path: ordered ring buffer drained by writer thread Parallel pread is disabled for stdin, .gz input, split mode, and readsToProcess limit. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Worker threads use pread(2) to read R1/R2 chunks in parallel, bypassing the dual reader thread bottleneck. Both files are indexed with FastqChunkIndex; pack counts must match (same number of reads). Each thread atomically grabs a pack index, preads from both files, parses both into ReadPacks, and calls processPairEnd. All 7 PE writers (left, right, unpaired x2, merged, failed, overlapped) use pack-index-based ordered output via inputWithSeq. Parallel pread is disabled for interleaved input, stdin, .gz input, split mode, and readsToProcess limit. Also adds FastqChunkIndex and FastqChunkParser utility classes for scanning uncompressed FASTQ files into pack-aligned byte ranges and parsing raw byte buffers into Read objects. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

KimBioInfoStudio and others added 3 commits March 27, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallel pread for SE/PE uncompressed FASTQ input#674

perf: parallel pread for SE/PE uncompressed FASTQ input#674
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
KimBioInfoStudio:worktree-parallel-pread

KimBioInfoStudio commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KimBioInfoStudio commented Mar 27, 2026

Summary

Commits

When parallel pread activates

Benchmark (Apple M4 Pro, 10M PE reads, -w 14)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Benchmark (Apple M4 Pro, 10M PE reads, `-w 14`)