Skip to content

fix: handle file names exceeding OS limit in cache (#539)#572

Open
Ashut0sh-mishra wants to merge 4 commits intoallenai:mainfrom
Ashut0sh-mishra:fix/file-name-too-long-539
Open

fix: handle file names exceeding OS limit in cache (#539)#572
Ashut0sh-mishra wants to merge 4 commits intoallenai:mainfrom
Ashut0sh-mishra:fix/file-name-too-long-539

Conversation

@Ashut0sh-mishra
Copy link
Copy Markdown

Fixes #539

Traced through the OSError traceback — it lands on
open(cache_path, "wb") inside get_from_cache(). The root
cause is url_to_filename() appending the full trailing URL
component (e.g. tfidf_vectors_sparse.npz) to a double-sha256
hash, producing filenames up to ~155 characters. That exceeds
eCryptfs's 143-byte NAME_MAX.

Fixed url_to_filename() to only keep the file extension
(.npz, .bin, etc.) instead of the whole filename, capping
output at 133 chars worst case. Also added
_find_legacy_cache_path() so existing cached files are still
found without re-downloading.

Changes:

  • scispacy/file_cache.py — updated url_to_filename()
  • tests/test_file_cache.py — added 3 new tests

All 6 tests pass (3 existing + 3 new).

Co-authored-by: nik464 [email protected]

Long entity names or cache keys could exceed the 255-character
filesystem limit causing OSError. Changed url_to_filename() to
only preserve the file extension instead of the full trailing URL
component, keeping filenames under 143 bytes (eCryptfs limit).
Added backward-compat lookup for old-style cache entries.

Fixes allenai#539

Co-authored-by: nik464 <[email protected]>
Comment thread scispacy/file_cache.py Outdated
Replaced url.split('/') with PurePosixPath().name as
suggested in review - avoids hardcoded path separator
so it works on Windows too.
@Ashut0sh-mishra Ashut0sh-mishra requested a review from cthoyt April 15, 2026 09:55
Comment thread scispacy/file_cache.py Outdated
Per cthoyt's review - urlparse() is the right tool for
decomposing URLs, not pathlib.
Comment thread tests/test_file_cache.py Outdated
@Ashut0sh-mishra Ashut0sh-mishra requested a review from cthoyt April 15, 2026 10:43
@Ashut0sh-mishra
Copy link
Copy Markdown
Author

Hi @cthoyt,

I’ve addressed all feedback. Could you please review when you have time?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File name too long

2 participants