Skip to content

fix: pipeline hangs when submitting from compute nodes#450

Merged
cmeesters merged 4 commits intosnakemake:mainfrom
jayhesselberth:fix/compute-node-hang
Apr 17, 2026
Merged

fix: pipeline hangs when submitting from compute nodes#450
cmeesters merged 4 commits intosnakemake:mainfrom
jayhesselberth:fix/compute-node-hang

Conversation

@jayhesselberth
Copy link
Copy Markdown
Contributor

@jayhesselberth jayhesselberth commented Apr 5, 2026

When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever.

The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle.

Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep.

Summary by CodeRabbit

  • Bug Fixes
    • Cleaner SLURM environment detection and immediate cleanup during executor startup, with an earlier warning when a SLURM job context is present to improve job submission reliability.
  • Tests
    • Test suite updated to align with the revised executor initialization and warning behavior; expectations remain unchanged.

When running snakemake from within a SLURM job (e.g., an interactive
session on a compute node), the pipeline would submit jobs but never
detect their completion, hanging forever.

The RemoteExecutor base class starts a status-checking daemon thread
in __init__ before __post_init__ is called. The SLURM plugin's
warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then
delete SLURM environment variables, but by then the daemon thread had
already started and would silently die after its first polling cycle.

Fix: move the SLURM environment detection and cleanup into __init__,
before super().__init__() starts the daemon thread. Remove the now
unnecessary warn_on_jobcontext() method and its 5-second sleep.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d122b4d7-cf3a-40e6-8911-e207c23c22df

📥 Commits

Reviewing files that changed from the base of the PR and between 809a21e and bbe651a.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_slurm/__init__.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • snakemake_executor_plugin_slurm/init.py

Walkthrough

Executor now performs SLURM-job-context detection and calls delete_slurm_environment() during __init__ (before super().__init__), removing the previous warn_on_jobcontext and its delayed cleanup. Tests were updated to stop mocking the removed warning method.

Changes

Cohort / File(s) Summary
SLURM Executor Initialization
snakemake_executor_plugin_slurm/__init__.py
Added Executor.__init__(self, workflow, logger) that checks SLURM_JOB_ID, logs a warning, and calls delete_slurm_environment() before super().__init__. Removed warn_on_jobcontext and its __post_init__ invocation. Minor formatting change to the "PREEMPTED" warning string and simplified tuple assignment in check_active_jobs.
Tests
tests/test_cli.py
Removed mocks of Executor.warn_on_jobcontext in tests test_jobname_prefix_applied and test_jobname_prefix_validation; tests now rely on real initialization behavior (still patching uuid.uuid4 where applicable).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudged init early, sniffed SLURM in the air,

Swept the env away with a twitch and a care,
No sleepy delay, no post-time tumble,
Fresh start, light paws — the runtime won't grumble!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: pipeline hangs when submitting from compute nodes' directly and specifically addresses the core issue: preventing hangs when submitting from compute nodes by fixing the order of SLURM environment cleanup.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_cli.py (1)

37-50: Please add a regression test that hits Executor.__init__().

These tests still build the object with Executor.__new__() and call __post_init__() directly, so the moved cleanup path in Executor.__init__() is never exercised. Please add one test that instantiates Executor(...) with SLURM_JOB_ID set and patches RemoteExecutor.__init__() to assert the environment is already cleaned before base initialization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 37 - 50, Add a regression test that
constructs the real Executor by calling Executor(...) (not using __new__ +
__post_init__) with SLURM_JOB_ID set in os.environ, and patch
RemoteExecutor.__init__ to assert that os.environ lacks SLURM_JOB_ID (i.e., the
cleanup in Executor.__init__ ran) before delegating to the original
RemoteExecutor.__init__; use the same test helpers as other tests (e.g.,
_make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_cli.py`:
- Around line 37-50: Add a regression test that constructs the real Executor by
calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set
in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks
SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to
the original RemoteExecutor.__init__; use the same test helpers as other tests
(e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de74c8aa-668c-4c49-9b3f-0ed28df0bb6c

📥 Commits

Reviewing files that changed from the base of the PR and between 7fa975f and 3d024a0.

📒 Files selected for processing (2)
  • snakemake_executor_plugin_slurm/__init__.py
  • tests/test_cli.py

@cmeesters
Copy link
Copy Markdown
Member

Thanks for this PR!

At the Snakemake Hackathon I noticed, that even when unsetting all $SLURM... env vars before starting Snakemake within a job, all jobs are submitted with only one thread. I did not find the time to investigate. Are you experiencing the same issue? If not, what did you do? Perhaps we can profit from that experience.

@cmeesters
Copy link
Copy Markdown
Member

@jayhesselberth I am actually fine with this PR. Will you apply black on the code to fix the formatting issue?

What I meant by my last remark: If you have an order of commands which solves the start-within-jobcontext-issue, I am eager to learn.

@jayhesselberth
Copy link
Copy Markdown
Contributor Author

@cmeesters in our case, it was a combination of this fix and not having sacct set up correctly on some of our compute nodes (some couldn't talk to slurmdbctl).

@cmeesters cmeesters self-requested a review April 17, 2026 08:44
Copy link
Copy Markdown
Member

@cmeesters cmeesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayhesselberth Ok, I will fix the formatting prior to the next release, but will merge it already.

@cmeesters cmeesters merged commit a09a027 into snakemake:main Apr 17, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants