perf: parallel pread for SE/PE uncompressed FASTQ input#674
Open
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
Open
perf: parallel pread for SE/PE uncompressed FASTQ input#674KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
Conversation
Previously pwrite mode only activated for .gz output (parallel libdeflate compression). Now any non-stdout multi-threaded output uses pwrite: worker threads write directly to non-overlapping file regions via pwrite(2), bypassing the single writer thread. For .gz: compress with libdeflate then pwrite (existing behavior). For .fq: pwrite raw output bytes directly (new). Also adds hybrid backoff to the pwrite spin-wait: yield for 256 iterations then usleep(1), preventing CPU burn when threads wait on predecessor sequence numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Worker threads use pread(2) to read chunks from uncompressed FASTQ files in parallel, bypassing the single reader thread bottleneck. Each thread atomically grabs the next pack index via fetch_add, reads its chunk with pread, parses it, and processes it. WriterThread gains ordered-mode (setOrderedMode/inputWithSeq): - pwrite path: uses pack index as sequence number in offset ring - non-pwrite path: ordered ring buffer drained by writer thread Parallel pread is disabled for stdin, .gz input, split mode, and readsToProcess limit. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Worker threads use pread(2) to read R1/R2 chunks in parallel, bypassing the dual reader thread bottleneck. Both files are indexed with FastqChunkIndex; pack counts must match (same number of reads). Each thread atomically grabs a pack index, preads from both files, parses both into ReadPacks, and calls processPairEnd. All 7 PE writers (left, right, unpaired x2, merged, failed, overlapped) use pack-index-based ordered output via inputWithSeq. Parallel pread is disabled for interleaved input, stdin, .gz input, split mode, and readsToProcess limit. Also adds FastqChunkIndex and FastqChunkParser utility classes for scanning uncompressed FASTQ files into pack-aligned byte ranges and parsing raw byte buffers into Read objects. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pread(2)in parallel, bypassing the single reader thread bottleneck. AFastqChunkIndexscans the file once to build pack-aligned byte offsets, then each worker atomically grabs the next pack index, preads its chunk, and parses it into reads.pwrite(2)output (previously gz-only) to raw.fqoutput. Worker threads write directly to non-overlapping file regions, bypassing the single writer thread.WriterThreadgainsinputWithSeq(tid, data, seq)which uses pack index as sequence number — for pwrite mode via the offset ring, for non-pwrite mode (stdout) via a new ordered ring buffer drained by the writer thread.Commits
perf: extend pwrite parallel write to raw FASTQ output— raw fq pwrite + hybrid spin backoffperf: SE parallel pread with ordered output— SE parallel read + ordered writer modeperf: PE parallel pread with ordered output— PE dual-file parallel read + all 7 writers orderedWhen parallel pread activates
.gz)--reads_to_processlimitFalls back to the existing sequential reader otherwise.
Benchmark (Apple M4 Pro, 10M PE reads,
-w 14)Small
-wshows no benefit on fast NVMe — the reader thread is not the bottleneck at low thread counts. Requesting benchmark on high-core-count x86 systems where the single reader thread may become a bottleneck with large-w.@sfchen Could you benchmark this on a high-core x86 machine with
-w 32or higher on large uncompressed FASTQ files? The parallel pread should show more benefit when the sequential reader becomes the bottleneck.Test plan
-w 8) completes without hang🤖 Generated with Claude Code