fix: pipeline hangs when submitting from compute nodes#450
fix: pipeline hangs when submitting from compute nodes#450cmeesters merged 4 commits intosnakemake:mainfrom
Conversation
When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever. The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle. Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughExecutor now performs SLURM-job-context detection and calls Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_cli.py (1)
37-50: Please add a regression test that hitsExecutor.__init__().These tests still build the object with
Executor.__new__()and call__post_init__()directly, so the moved cleanup path inExecutor.__init__()is never exercised. Please add one test that instantiatesExecutor(...)withSLURM_JOB_IDset and patchesRemoteExecutor.__init__()to assert the environment is already cleaned before base initialization.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 37 - 50, Add a regression test that constructs the real Executor by calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to the original RemoteExecutor.__init__; use the same test helpers as other tests (e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not removed so the moved cleanup path in Executor.__init__ is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/test_cli.py`:
- Around line 37-50: Add a regression test that constructs the real Executor by
calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set
in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks
SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to
the original RemoteExecutor.__init__; use the same test helpers as other tests
(e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: de74c8aa-668c-4c49-9b3f-0ed28df0bb6c
📒 Files selected for processing (2)
snakemake_executor_plugin_slurm/__init__.pytests/test_cli.py
|
Thanks for this PR! At the Snakemake Hackathon I noticed, that even when unsetting all |
|
@jayhesselberth I am actually fine with this PR. Will you apply What I meant by my last remark: If you have an order of commands which solves the start-within-jobcontext-issue, I am eager to learn. |
|
@cmeesters in our case, it was a combination of this fix and not having |
cmeesters
left a comment
There was a problem hiding this comment.
@jayhesselberth Ok, I will fix the formatting prior to the next release, but will merge it already.
When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever.
The RemoteExecutor base class starts a status-checking daemon thread in
__init__before__post_init__is called. The SLURM plugin'swarn_on_jobcontext()in__post_init__would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle.Fix: move the SLURM environment detection and cleanup into
__init__, beforesuper().__init__()starts the daemon thread. Remove the now unnecessarywarn_on_jobcontext()method and its 5-second sleep.Summary by CodeRabbit