Skip to content

fix: _copy_tree respects .dockerignore patterns to avoid OSError on long paths#3166

Open
dreamcreated wants to merge 1 commit intoopendatahub-io:mainfrom
dreamcreated:fix/sandbox-dockerignore-copy-tree
Open

fix: _copy_tree respects .dockerignore patterns to avoid OSError on long paths#3166
dreamcreated wants to merge 1 commit intoopendatahub-io:mainfrom
dreamcreated:fix/sandbox-dockerignore-copy-tree

Conversation

@dreamcreated
Copy link
Copy Markdown

@dreamcreated dreamcreated commented Mar 21, 2026

Summary

Fixes #3150

The sandbox script's _copy_tree walked all directories via os.walk(followlinks=True) without filtering, including content-addressed directory stores like .pnpm-store that can have deeply recursive path structures. This caused:

OSError: [Errno 63] File name too long: '.../jupyter/utils/addons/.pnpm-store/v10/projects/.../.pnpm-store/v10/projects/.../.pnpm-store/...'

Changes

  • _load_dockerignore(root) — parse .dockerignore, return non-comment, non-empty lines. No new dependencies (stdlib only).
  • _ignored_dir_names(root) — extract bare directory names from **/name/ patterns only. Root-relative patterns (e.g. ci/, bin/) are intentionally excluded since they only apply at the repo root, not inside arbitrary prerequisite subdirectories.
  • _copy_tree — accepts optional dir_ignore_names: set[str] parameter; prunes matching entries from dirnames in-place so os.walk never descends into them.
  • setup_sandbox — computes the ignore set once from .dockerignore and passes it to all _copy_tree calls.

Tests

Added three new test classes:

  • TestLoadDockerignore — covers missing file, comment skipping, and normal lines
  • TestIgnoredDirNames — covers globstar extraction, exclusion of root-relative patterns, exclusion of nested path patterns
  • TestCopyTreeWithIgnore — covers node_modules, .pnpm-store, and no-ignore baseline

Current .dockerignore patterns handled

**/node_modules/     → skips "node_modules" dirs at any depth
**/.mypy_cache/      → skips ".mypy_cache" dirs at any depth
**/.pytest_cache/    → skips ".pytest_cache" dirs at any depth
**/__pycache__/      → skips "__pycache__" dirs at any depth

Note: .pnpm-store is not yet in the root .dockerignore — it was added separately in PR #3142. Once that merges, this fix will also handle it automatically.

Summary by CodeRabbit

  • New Features

    • Added .dockerignore-based directory exclusion during sandbox setup to skip specified directories.
  • Performance

    • Sandbox copy now avoids descending into ignored directories, speeding up setup and reducing created files.
  • Tests

    • Expanded tests for dockerignore parsing, globstar and dot-directory handling, missing file behavior, and copy behavior with/without ignores.

@github-actions github-actions bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Mar 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 21, 2026

Hi @dreamcreated. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: f5aa85fd-4968-4f23-b583-941951b7936f

📥 Commits

Reviewing files that changed from the base of the PR and between d619dbd and 79544dc.

📒 Files selected for processing (2)
  • scripts/sandbox.py
  • scripts/sandbox_tests.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/sandbox_tests.py
  • scripts/sandbox.py

📝 Walkthrough

Walkthrough

Added .dockerignore-aware directory pruning to the sandbox copy logic. New helpers _load_dockerignore(root) (parses .dockerignore, skips comments/blanks) and _ignored_dir_names(root) (extracts bare directory names from patterns like **/<name>/ or <name>/) were introduced. _copy_tree gained an optional dir_ignore_names: set[str] | None parameter and now filters dirnames in-place during os.walk to avoid descending into ignored directories. setup_sandbox computes the ignore set once and passes it to _copy_tree calls. Tests were added to exercise the new behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title uses imperative mood ('fix') and clearly describes the main change, but lacks a ticket reference (JIRA/GitHub issue) as a strong suggestion.
Description check ✅ Passed Description provides clear summary, changes, tests, and rationale, but is missing the self-checklist and merge criteria completions required by the template.
Linked Issues check ✅ Passed Code changes fully implement requirements from #3150: added _load_dockerignore and _ignored_dir_names functions, modified _copy_tree with dir_ignore_names parameter, and integrated into setup_sandbox with proper tests.
Out of Scope Changes check ✅ Passed All changes directly address #3150: helpers for .dockerignore parsing, ignore-set computation, tree-copy filtering, and comprehensive unit tests—no out-of-scope modifications detected.
Branch Prefix Policy ✅ Passed PR targets main branch with title 'fix: _copy_tree respects .dockerignore patterns to avoid OSError' correctly lacking branch prefix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from daniellutz and ysok March 21, 2026 02:14
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign daniellutz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added size/l and removed size/l labels Mar 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
scripts/sandbox_tests.py (2)

64-119: Missing test coverage for edge cases in pattern parsing.

Consider adding tests for:

  1. Malformed patterns like **/ (empty name after stripping)
  2. Patterns with trailing whitespace before newline
  3. .dockerignore with Windows-style line endings (\r\n)

These edge cases could surface in real repositories.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/sandbox_tests.py` around lines 64 - 119, Add unit tests to cover edge
cases in .dockerignore parsing: create tests (e.g., in TestLoadDockerignore)
that write malformed patterns like "**/\n" and patterns with trailing whitespace
("bin/ \n") and Windows CRLF endings ("\r\n") then assert
_load_dockerignore(root) strips whitespace, ignores empty entries, and
normalizes line endings so returned list contains only valid patterns like
"bin/"; similarly add tests in TestIgnoredDirNames that write "**/\n", "bin/
\r\n" and CRLF files and assert _ignored_dir_names(root) does not include empty
names, correctly includes "bin" for recursive patterns, and ignores
root-relative or nested-path patterns as before. Ensure tests reference
_load_dockerignore and _ignored_dir_names so parsing logic is exercised for
these edge cases.

122-164: No symlink traversal test for _copy_tree with dir_ignore_names.

The implementation uses followlinks=True. A test confirming that ignored directory filtering works correctly when symlinks are involved (e.g., a symlink named node_modules pointing elsewhere) would strengthen confidence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/sandbox_tests.py` around lines 122 - 164, Add a test to cover symlink
traversal with dir_ignore_names: create a real directory outside the src (e.g.,
outside tmp_path/src) containing files, then inside src create a symlink named
"node_modules" (or ".pnpm-store") pointing to that external directory and call
_copy_tree with dir_ignore_names including that symlink name; assert that the
symlinked directory and its contents are not copied into dst. Place this new
test alongside TestCopyTreeWithIgnore (e.g.,
test_symlink_named_ignored_dir_not_copied) to ensure _copy_tree's
followlinks=True does not cause ignored names to be traversed and copied.
scripts/sandbox.py (2)

119-128: Negation patterns (!pattern) are silently ignored.

.dockerignore supports negation (e.g., !**/important/). The current implementation silently drops these. If a repo uses negation to un-ignore specific directories, they'll still be excluded here.

This is documented as intentional, but consider logging a warning when negation patterns are encountered so users aren't surprised.

Suggested change
 def _ignored_dir_names(root: pathlib.Path) -> set[str]:
     ...
     ignored: set[str] = set()
     for pattern in _load_dockerignore(root):
+        if pattern.startswith("!"):
+            log.debug("negation pattern ignored (unsupported)", pattern=pattern)
+            continue
         # Only accept "**/<name>/" – i.e., starts with "**/" and ends with "/"
         # with no additional "/" in between.
         if not pattern.startswith("**/") or not pattern.endswith("/"):
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/sandbox.py` around lines 119 - 128, The loop that builds `ignored`
from `_load_dockerignore` silently skips negation patterns like
`!**/important/`; update the loop in the same function (the block that iterates
`for pattern in _load_dockerignore(root):`) to detect patterns that start with
"!" and emit a warning including the raw pattern (e.g., via logging.warning or
the module logger) before continuing, so users are informed that negation
patterns are intentionally ignored; keep the existing behavior of not treating
negations as un-ignore.

103-106: Variable shadowing reduces readability.

line is reused for both the raw and stripped value. Minor, but a distinct name like stripped clarifies intent.

Suggested change
     for line in dockerignore.read_text().splitlines():
-        line = line.strip()
-        if line and not line.startswith("#"):
-            lines.append(line)
+        stripped = line.strip()
+        if stripped and not stripped.startswith("#"):
+            lines.append(stripped)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/sandbox.py` around lines 103 - 106, The loop that reads dockerignore
currently reuses the variable name `line` for both the raw and stripped value,
reducing readability; change the loop to iterate with a distinct name (e.g.,
`raw_line`) and assign `stripped = raw_line.strip()`, then use `stripped` in the
conditional and `lines.append(stripped)` so the original raw text and the
processed value are not shadowed (update occurrences around the for loop that
currently references `line` and `lines`).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/sandbox_tests.py`:
- Around line 64-119: Add unit tests to cover edge cases in .dockerignore
parsing: create tests (e.g., in TestLoadDockerignore) that write malformed
patterns like "**/\n" and patterns with trailing whitespace ("bin/ \n") and
Windows CRLF endings ("\r\n") then assert _load_dockerignore(root) strips
whitespace, ignores empty entries, and normalizes line endings so returned list
contains only valid patterns like "bin/"; similarly add tests in
TestIgnoredDirNames that write "**/\n", "bin/ \r\n" and CRLF files and assert
_ignored_dir_names(root) does not include empty names, correctly includes "bin"
for recursive patterns, and ignores root-relative or nested-path patterns as
before. Ensure tests reference _load_dockerignore and _ignored_dir_names so
parsing logic is exercised for these edge cases.
- Around line 122-164: Add a test to cover symlink traversal with
dir_ignore_names: create a real directory outside the src (e.g., outside
tmp_path/src) containing files, then inside src create a symlink named
"node_modules" (or ".pnpm-store") pointing to that external directory and call
_copy_tree with dir_ignore_names including that symlink name; assert that the
symlinked directory and its contents are not copied into dst. Place this new
test alongside TestCopyTreeWithIgnore (e.g.,
test_symlink_named_ignored_dir_not_copied) to ensure _copy_tree's
followlinks=True does not cause ignored names to be traversed and copied.

In `@scripts/sandbox.py`:
- Around line 119-128: The loop that builds `ignored` from `_load_dockerignore`
silently skips negation patterns like `!**/important/`; update the loop in the
same function (the block that iterates `for pattern in
_load_dockerignore(root):`) to detect patterns that start with "!" and emit a
warning including the raw pattern (e.g., via logging.warning or the module
logger) before continuing, so users are informed that negation patterns are
intentionally ignored; keep the existing behavior of not treating negations as
un-ignore.
- Around line 103-106: The loop that reads dockerignore currently reuses the
variable name `line` for both the raw and stripped value, reducing readability;
change the loop to iterate with a distinct name (e.g., `raw_line`) and assign
`stripped = raw_line.strip()`, then use `stripped` in the conditional and
`lines.append(stripped)` so the original raw text and the processed value are
not shadowed (update occurrences around the for loop that currently references
`line` and `lines`).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7a60ba32-b509-43b5-b8b2-2a76206bc399

📥 Commits

Reviewing files that changed from the base of the PR and between 7cbb4b0 and d619dbd.

📒 Files selected for processing (2)
  • scripts/sandbox.py
  • scripts/sandbox_tests.py

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 22, 2026

Codecov Report

❌ Patch coverage is 0% with 105 lines in your changes missing coverage. Please review.
✅ Project coverage is 3.23%. Comparing base (25f8889) to head (79544dc).

Files with missing lines Patch % Lines
scripts/sandbox_tests.py 0.00% 80 Missing ⚠️
scripts/sandbox.py 0.00% 25 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##            main   #3166      +/-   ##
========================================
- Coverage   3.32%   3.23%   -0.10%     
========================================
  Files         31      31              
  Lines       3550    3651     +101     
  Branches     569     576       +7     
========================================
  Hits         118     118              
- Misses      3430    3531     +101     
  Partials       2       2              
Flag Coverage Δ
python 3.23% <0.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
scripts/sandbox.py 0.00% <0.00%> (ø)
scripts/sandbox_tests.py 0.00% <0.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25f8889...79544dc. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jiridanek
Copy link
Copy Markdown
Member

/lgtm, it's what the ticket asked for; I'm still a bit unhappy about the amount of code this requires; but I'd be even unhappier bringing in random deps like https://github.com/mherrmann/gitignore_parser, so, ok...

I think we'll merge it

The sandbox script's _copy_tree walked all directories via os.walk,
including content-addressed stores like .pnpm-store that have deeply
recursive path structures. This caused:

  OSError: [Errno 63] File name too long

Fixes opendatahub-io#3150

Changes:
- Add _load_dockerignore(root) to parse .dockerignore lines
- Add _ignored_dir_names(root) to extract bare dir names from **/<name>/
  patterns (only globstar patterns apply at arbitrary depth; root-relative
  patterns like ci/ or bin/ are intentionally excluded)
- Modify _copy_tree to accept dir_ignore_names: set[str] and prune
  matching entries from dirnames in-place before os.walk descends
- Update setup_sandbox to compute the ignore set once and pass it to
  all _copy_tree calls
- Add TestLoadDockerignore, TestIgnoredDirNames, TestCopyTreeWithIgnore
  test classes covering the new helpers and the pruning behaviour

No new stdlib imports are required.
@jiridanek jiridanek force-pushed the fix/sandbox-dockerignore-copy-tree branch from d619dbd to 79544dc Compare March 31, 2026 08:03
@openshift-ci openshift-ci bot added size/l and removed size/l labels Mar 31, 2026
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai the path comparison logic is fragile and will cause problems down the read, confirm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiridanek — confirmed, there are several fragility points in the logic:


1. !-negation patterns are silently swallowed (most impactful)

.dockerignore supports negation with !:

**/node_modules/
!**/node_modules/special-keep/

_load_dockerignore retains !**/node_modules/special-keep/ (it's not a comment), but _ignored_dir_names drops it because it doesn't startswith("**/"). The original **/node_modules/ is still added to the ignore set, so special-keep is incorrectly pruned. Negations are silently lost with no warning.

Fix: strip a leading ! before the startswith check, and skip adding the base pattern when a negation for it exists — or at minimum emit a log.warning that negation is unsupported.


2. if dir_ignore_names: — falsy empty set skips pruning

if dir_ignore_names:          # empty set() is falsy
    dirnames[:] = [...]

If _ignored_dir_names returns set() (no matching patterns), the guard is False and pruning is skipped — which is the correct outcome here — but the intent reads as "was a set passed at all?" A cleaner guard is:

if dir_ignore_names is not None:
    dirnames[:] = [d for d in dirnames if d not in dir_ignore_names]

This also makes the None-vs-empty-set distinction explicit.


3. Magic-number slice pattern[3:-1]

name = pattern[3:-1]  # strip leading "**/" and trailing "/"

This works today but is brittle. removeprefix / removesuffix is both safer and self-documenting:

name = pattern.removeprefix("**/").removesuffix("/")

4. Case-sensitivity on macOS

d not in dir_ignore_names is a plain string equality check. On a case-insensitive filesystem (macOS default), a directory named Node_Modules would not match node_modules in the set and would slip through. This is a minor/theoretical issue for this use case but worth noting.


Summary table:

Issue Severity Correct behavior today?
! negation silently ignored Medium ❌ incorrect pruning possible
if dir_ignore_names: vs is not None Low ✅ but misleading
pattern[3:-1] magic slice Low ✅ but fragile to malformed input
Case-insensitive FS Low/Theoretical ❌ missed pruning on macOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel size/l

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandbox script _copy_tree should respect .dockerignore patterns

3 participants