Skip to content

perf: parallel pread for SE/PE uncompressed FASTQ input#674

Open
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
KimBioInfoStudio:worktree-parallel-pread
Open

perf: parallel pread for SE/PE uncompressed FASTQ input#674
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
KimBioInfoStudio:worktree-parallel-pread

Conversation

@KimBioInfoStudio
Copy link
Copy Markdown
Member

Summary

  • Parallel pread: Worker threads read uncompressed FASTQ input via pread(2) in parallel, bypassing the single reader thread bottleneck. A FastqChunkIndex scans the file once to build pack-aligned byte offsets, then each worker atomically grabs the next pack index, preads its chunk, and parses it into reads.
  • Raw FASTQ pwrite: Extends the existing parallel pwrite(2) output (previously gz-only) to raw .fq output. Worker threads write directly to non-overlapping file regions, bypassing the single writer thread.
  • Ordered output: WriterThread gains inputWithSeq(tid, data, seq) which uses pack index as sequence number — for pwrite mode via the offset ring, for non-pwrite mode (stdout) via a new ordered ring buffer drained by the writer thread.

Commits

  1. perf: extend pwrite parallel write to raw FASTQ output — raw fq pwrite + hybrid spin backoff
  2. perf: SE parallel pread with ordered output — SE parallel read + ordered writer mode
  3. perf: PE parallel pread with ordered output — PE dual-file parallel read + all 7 writers ordered

When parallel pread activates

  • Input must be uncompressed (not .gz)
  • Input must be regular files (not stdin)
  • Not in split mode
  • No --reads_to_process limit
  • PE: not interleaved input; R1/R2 must have same read count

Falls back to the existing sequential reader otherwise.

Benchmark (Apple M4 Pro, 10M PE reads, -w 14)

Mode Baseline Optimized Speedup Verify
se-fq-gz 14.04s 13.68s 1.03x PASS
se-fq-fq 12.86s 12.88s 1.00x PASS

Small -w shows no benefit on fast NVMe — the reader thread is not the bottleneck at low thread counts. Requesting benchmark on high-core-count x86 systems where the single reader thread may become a bottleneck with large -w.

@sfchen Could you benchmark this on a high-core x86 machine with -w 32 or higher on large uncompressed FASTQ files? The parallel pread should show more benefit when the sequential reader becomes the bottleneck.

Test plan

  • SE fq→fq output matches baseline byte-for-byte
  • SE fq→gz output matches baseline (decompressed content)
  • SE fq→stdout output matches baseline
  • PE fq→fq output matches baseline byte-for-byte (R1 + R2)
  • Small file (10 reads, -w 8) completes without hang
  • PE mode regression (sequential reader path unchanged)

🤖 Generated with Claude Code

KimBioInfoStudio and others added 3 commits March 27, 2026 12:10
Previously pwrite mode only activated for .gz output (parallel
libdeflate compression). Now any non-stdout multi-threaded output
uses pwrite: worker threads write directly to non-overlapping
file regions via pwrite(2), bypassing the single writer thread.

For .gz: compress with libdeflate then pwrite (existing behavior).
For .fq: pwrite raw output bytes directly (new).

Also adds hybrid backoff to the pwrite spin-wait: yield for 256
iterations then usleep(1), preventing CPU burn when threads wait
on predecessor sequence numbers.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Worker threads use pread(2) to read chunks from uncompressed FASTQ
files in parallel, bypassing the single reader thread bottleneck.
Each thread atomically grabs the next pack index via fetch_add,
reads its chunk with pread, parses it, and processes it.

WriterThread gains ordered-mode (setOrderedMode/inputWithSeq):
- pwrite path: uses pack index as sequence number in offset ring
- non-pwrite path: ordered ring buffer drained by writer thread

Parallel pread is disabled for stdin, .gz input, split mode, and
readsToProcess limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Worker threads use pread(2) to read R1/R2 chunks in parallel,
bypassing the dual reader thread bottleneck. Both files are indexed
with FastqChunkIndex; pack counts must match (same number of reads).
Each thread atomically grabs a pack index, preads from both files,
parses both into ReadPacks, and calls processPairEnd.

All 7 PE writers (left, right, unpaired x2, merged, failed,
overlapped) use pack-index-based ordered output via inputWithSeq.

Parallel pread is disabled for interleaved input, stdin, .gz input,
split mode, and readsToProcess limit.

Also adds FastqChunkIndex and FastqChunkParser utility classes for
scanning uncompressed FASTQ files into pack-aligned byte ranges
and parsing raw byte buffers into Read objects.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant