fix: handle file names exceeding OS limit in cache (#539)#572
Open
Ashut0sh-mishra wants to merge 4 commits intoallenai:mainfrom
Open
fix: handle file names exceeding OS limit in cache (#539)#572Ashut0sh-mishra wants to merge 4 commits intoallenai:mainfrom
Ashut0sh-mishra wants to merge 4 commits intoallenai:mainfrom
Conversation
Long entity names or cache keys could exceed the 255-character filesystem limit causing OSError. Changed url_to_filename() to only preserve the file extension instead of the full trailing URL component, keeping filenames under 143 bytes (eCryptfs limit). Added backward-compat lookup for old-style cache entries. Fixes allenai#539 Co-authored-by: nik464 <[email protected]>
cthoyt
reviewed
Apr 15, 2026
Replaced url.split('/') with PurePosixPath().name as
suggested in review - avoids hardcoded path separator
so it works on Windows too.
cthoyt
reviewed
Apr 15, 2026
Per cthoyt's review - urlparse() is the right tool for decomposing URLs, not pathlib.
cthoyt
reviewed
Apr 15, 2026
Author
|
Hi @cthoyt, I’ve addressed all feedback. Could you please review when you have time? Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #539
Traced through the OSError traceback — it lands on
open(cache_path, "wb") inside get_from_cache(). The root
cause is url_to_filename() appending the full trailing URL
component (e.g. tfidf_vectors_sparse.npz) to a double-sha256
hash, producing filenames up to ~155 characters. That exceeds
eCryptfs's 143-byte NAME_MAX.
Fixed url_to_filename() to only keep the file extension
(.npz, .bin, etc.) instead of the whole filename, capping
output at 133 chars worst case. Also added
_find_legacy_cache_path() so existing cached files are still
found without re-downloading.
Changes:
All 6 tests pass (3 existing + 3 new).
Co-authored-by: nik464 [email protected]