Skip to content

[s3] Add range_chunk_size param to read using multiple GET requests#887

Merged
ddelange merged 22 commits intodevelopfrom
chunked_s3
Oct 20, 2025
Merged

[s3] Add range_chunk_size param to read using multiple GET requests#887
ddelange merged 22 commits intodevelopfrom
chunked_s3

Conversation

@ddelange
Copy link
Copy Markdown
Collaborator

@ddelange ddelange commented Oct 9, 2025

Please pick a concise, informative and complete title for your PR.

The title is important because it will appear in our change log.

Motivation

Please explain the motivation behind this PR.

If you're fixing a bug, link to the issue using a supported keyword like "Fixes #{issue_number}".

If you're adding a new feature, then consider opening a ticket and discussing it with the maintainers before you actually do the hard work.

Fixes #725. Setting range_chunk_size > 0 will protect S3 servers from open range headers.

Based on #883, many thanks @batterseapower!

Tests

If you're fixing a bug, consider test-driven development:

  1. Create a unit test that demonstrates the bug. The test should fail.
  2. Implement your bug fix.
  3. The test you created should now pass.

If you're implementing a new feature, include unit tests for it.

Make sure all existing unit tests pass.
You can run them locally using:

pytest tests

If there are any failures, please fix them before creating the PR (or mark it as WIP, see below).

Work in progress

If you're still working on your PR, mark the PR as draft PR.

We'll skip reviewing it for the time being.

Once it's ready, mark the PR as "ready for review", and ping one of the maintainers (e.g. mpenkov).

Checklist

Before you mark the PR as "ready for review", please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Run python update_helptext.py in case there are API changes
  • Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.

Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

batterseapower and others added 10 commits September 16, 2025 14:19
This commit adds a new `range_chunk_size` parameter to the S3 reader that allows
reading large files in smaller chunks. This is useful when you only need to read
small portions of large S3 files, as it prevents S3-compatible storage systems
from queueing up the entire file internally.

Key changes:
- Added range_chunk_size parameter to smart_open.s3.open() and related classes
- Modified _SeekableRawReader to support chunked reading with proper boundary handling
- Added comprehensive test coverage including adversarial testing for retry logic
- Improved error handling for edge cases (negative offsets, empty files, etc.)

When range_chunk_size is None (default), behavior is unchanged - single request
for the whole file to minimize per-request costs on S3.
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
@ddelange ddelange force-pushed the chunked_s3 branch 7 times, most recently from 4f917ff to 4c2587e Compare October 9, 2025 17:34
@ddelange ddelange changed the title [s3] Add range_chunk_size to read from using multiple GET requests [s3] Add range_chunk_size to read using multiple GET requests Oct 9, 2025
@ddelange ddelange force-pushed the chunked_s3 branch 12 times, most recently from c9397c4 to 22261fd Compare October 9, 2025 21:12
@ddelange ddelange changed the title [s3] Add range_chunk_size to read using multiple GET requests [s3] Add range_chunk_size param to read using multiple GET requests Oct 9, 2025
…o chunked_s3

* 'develop' of https://github.com/piskvorky/smart_open:
  Protect against hanging tests (#888)
  Bump the github-actions group with 2 updates (#886)
  build: fix invalid `fallback_version` when builing with `uv` (#884)
Comment thread smart_open/s3.py Outdated
Comment thread smart_open/s3.py
Comment on lines +608 to +611
#
# range request may not always return partial content, see:
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests#partial_request_responses
#
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from unindenting one level here (hence the ugly diff), the changes below are minimal.

we just handle the 200 case properly: @batterseapower I simplified your discarding bytes logic and moved it into the 200 block since that's the only scenario where it's needed. the AdversarialClient was adjusted accordingly (and also unindented).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff fixed by merging #889 -- it includes the discarding bytes logic in the 200 case (valid improvement independent of this PR)

@ddelange ddelange marked this pull request as ready for review October 9, 2025 21:49
@ddelange ddelange requested review from mpenkov and piskvorky October 11, 2025 19:07
@ddelange
Copy link
Copy Markdown
Collaborator Author

@mpenkov @piskvorky @batterseapower please have a look 🙏

…o chunked_s3

* 'develop' of https://github.com/piskvorky/smart_open:
  Optimize forward seeks within buffered data to avoid redundant GET (#892)
  Add macos to CI (#891)
@ddelange ddelange merged commit 804e8a3 into develop Oct 20, 2025
6 checks passed
@ddelange ddelange deleted the chunked_s3 branch October 20, 2025 08:35
@ddelange ddelange restored the chunked_s3 branch October 20, 2025 08:36
@ddelange ddelange deleted the chunked_s3 branch October 20, 2025 08:36
ddelange added a commit that referenced this pull request Oct 20, 2025
* develop:
  Update CHANGELOG.md
  Use compression.zstd (PEP-784) (#895)
  Drop python 3.8, add python 3.14 (#896)
  [s3] Add range_chunk_size param to read using multiple GET requests (#887)
  Run tests in parallel (#893)
  Optimize forward seeks within buffered data to avoid redundant GET (#892)
  Add macos to CI (#891)
  Simplify CI, use uv (#890)
  [s3] Improve handling of InvalidRange and seek on empty file (#889)
  Protect against hanging tests (#888)
  Bump the github-actions group with 2 updates (#886)
  build: fix invalid `fallback_version` when builing with `uv` (#884)
  Remove travis leftover (#881)
  Disambiguate URI examples in README.rst (#879)
@github-actions
Copy link
Copy Markdown

Released v7.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[s3] Ranged read http header leaves TO bytes blank, causing S3 filesystems to read full file length

2 participants