[s3] Add range_chunk_size param to read using multiple GET requests#887
Merged
[s3] Add range_chunk_size param to read using multiple GET requests#887
Conversation
This commit adds a new `range_chunk_size` parameter to the S3 reader that allows reading large files in smaller chunks. This is useful when you only need to read small portions of large S3 files, as it prevents S3-compatible storage systems from queueing up the entire file internally. Key changes: - Added range_chunk_size parameter to smart_open.s3.open() and related classes - Modified _SeekableRawReader to support chunked reading with proper boundary handling - Added comprehensive test coverage including adversarial testing for retry logic - Improved error handling for edge cases (negative offsets, empty files, etc.) When range_chunk_size is None (default), behavior is unchanged - single request for the whole file to minimize per-request costs on S3.
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
4f917ff to
4c2587e
Compare
c9397c4 to
22261fd
Compare
…o chunked_s3 * 'develop' of https://github.com/piskvorky/smart_open: Protect against hanging tests (#888) Bump the github-actions group with 2 updates (#886) build: fix invalid `fallback_version` when builing with `uv` (#884)
ddelange
commented
Oct 9, 2025
ddelange
commented
Oct 9, 2025
Comment on lines
+608
to
+611
| # | ||
| # range request may not always return partial content, see: | ||
| # https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests#partial_request_responses | ||
| # |
Collaborator
Author
There was a problem hiding this comment.
apart from unindenting one level here (hence the ugly diff), the changes below are minimal.
we just handle the 200 case properly: @batterseapower I simplified your discarding bytes logic and moved it into the 200 block since that's the only scenario where it's needed. the AdversarialClient was adjusted accordingly (and also unindented).
Collaborator
Author
There was a problem hiding this comment.
diff fixed by merging #889 -- it includes the discarding bytes logic in the 200 case (valid improvement independent of this PR)
6 tasks
6 tasks
…o chunked_s3 * 'develop' of https://github.com/piskvorky/smart_open: [s3] Improve handling of InvalidRange and seek on empty file (#889)
…o chunked_s3 * 'develop' of https://github.com/piskvorky/smart_open: Simplify CI, use uv (#890)
Collaborator
Author
|
@mpenkov @piskvorky @batterseapower please have a look 🙏 |
…o chunked_s3 * 'develop' of https://github.com/piskvorky/smart_open: Optimize forward seeks within buffered data to avoid redundant GET (#892) Add macos to CI (#891)
ddelange
added a commit
that referenced
this pull request
Oct 20, 2025
* develop: Update CHANGELOG.md Use compression.zstd (PEP-784) (#895) Drop python 3.8, add python 3.14 (#896) [s3] Add range_chunk_size param to read using multiple GET requests (#887) Run tests in parallel (#893) Optimize forward seeks within buffered data to avoid redundant GET (#892) Add macos to CI (#891) Simplify CI, use uv (#890) [s3] Improve handling of InvalidRange and seek on empty file (#889) Protect against hanging tests (#888) Bump the github-actions group with 2 updates (#886) build: fix invalid `fallback_version` when builing with `uv` (#884) Remove travis leftover (#881) Disambiguate URI examples in README.rst (#879)
|
Released v7.4.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes #725. Setting range_chunk_size > 0 will protect S3 servers from open range headers.
Based on #883, many thanks @batterseapower!
Tests
Work in progress
Checklist
python update_helptext.pyin case there are API changesWorkflow