fix: pipeline hangs when submitting from compute nodes by jayhesselberth · Pull Request #450 · snakemake/snakemake-executor-plugin-slurm

jayhesselberth · 2026-04-05T00:32:28Z

When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever.

The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle.

Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep.

Summary by CodeRabbit

Bug Fixes
- Cleaner SLURM environment detection and immediate cleanup during executor startup, with an earlier warning when a SLURM job context is present to improve job submission reliability.
Tests
- Test suite updated to align with the revised executor initialization and warning behavior; expectations remain unchanged.

When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever. The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle. Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

coderabbitai · 2026-04-05T00:32:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d122b4d7-cf3a-40e6-8911-e207c23c22df

📥 Commits

Reviewing files that changed from the base of the PR and between 809a21e and bbe651a.

📒 Files selected for processing (1)

snakemake_executor_plugin_slurm/__init__.py

🚧 Files skipped from review as they are similar to previous changes (1)

snakemake_executor_plugin_slurm/init.py

Walkthrough

Executor now performs SLURM-job-context detection and calls delete_slurm_environment() during __init__ (before super().__init__), removing the previous warn_on_jobcontext and its delayed cleanup. Tests were updated to stop mocking the removed warning method.

Changes

Cohort / File(s)	Summary
SLURM Executor Initialization `snakemake_executor_plugin_slurm/__init__.py`	Added `Executor.__init__(self, workflow, logger)` that checks `SLURM_JOB_ID`, logs a warning, and calls `delete_slurm_environment()` before `super().__init__`. Removed `warn_on_jobcontext` and its `__post_init__` invocation. Minor formatting change to the `"PREEMPTED"` warning string and simplified tuple assignment in `check_active_jobs`.
Tests `tests/test_cli.py`	Removed mocks of `Executor.warn_on_jobcontext` in tests `test_jobname_prefix_applied` and `test_jobname_prefix_validation`; tests now rely on real initialization behavior (still patching `uuid.uuid4` where applicable).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudged init early, sniffed SLURM in the air,

Swept the env away with a twitch and a care,
No sleepy delay, no post-time tumble,
Fresh start, light paws — the runtime won't grumble!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: pipeline hangs when submitting from compute nodes' directly and specifically addresses the core issue: preventing hangs when submitting from compute nodes by fixing the order of SLURM environment cleanup.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/test_cli.py (1)
37-50: Please add a regression test that hits Executor.__init__().

These tests still build the object with Executor.__new__() and call __post_init__() directly, so the moved cleanup path in Executor.__init__() is never exercised. Please add one test that instantiates Executor(...) with SLURM_JOB_ID set and patches RemoteExecutor.__init__() to assert the environment is already cleaned before base initialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 37 - 50, Add a regression test that
constructs the real Executor by calling Executor(...) (not using __new__ +
__post_init__) with SLURM_JOB_ID set in os.environ, and patch
RemoteExecutor.__init__ to assert that os.environ lacks SLURM_JOB_ID (i.e., the
cleanup in Executor.__init__ ran) before delegating to the original
RemoteExecutor.__init__; use the same test helpers as other tests (e.g.,
_make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_cli.py`:
- Around line 37-50: Add a regression test that constructs the real Executor by
calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set
in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks
SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to
the original RemoteExecutor.__init__; use the same test helpers as other tests
(e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de74c8aa-668c-4c49-9b3f-0ed28df0bb6c

📥 Commits

Reviewing files that changed from the base of the PR and between 7fa975f and 3d024a0.

📒 Files selected for processing (2)

snakemake_executor_plugin_slurm/__init__.py
tests/test_cli.py

cmeesters · 2026-04-05T07:16:40Z

Thanks for this PR!

At the Snakemake Hackathon I noticed, that even when unsetting all $SLURM... env vars before starting Snakemake within a job, all jobs are submitted with only one thread. I did not find the time to investigate. Are you experiencing the same issue? If not, what did you do? Perhaps we can profit from that experience.

cmeesters · 2026-04-15T13:20:09Z

@jayhesselberth I am actually fine with this PR. Will you apply black on the code to fix the formatting issue?

What I meant by my last remark: If you have an order of commands which solves the start-within-jobcontext-issue, I am eager to learn.

jayhesselberth · 2026-04-15T15:48:36Z

@cmeesters in our case, it was a combination of this fix and not having sacct set up correctly on some of our compute nodes (some couldn't talk to slurmdbctl).

cmeesters

@jayhesselberth Ok, I will fix the formatting prior to the next release, but will merge it already.

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

style: format with black

809a21e

cmeesters added 2 commits April 16, 2026 22:36

Merge branch 'main' into fix/compute-node-hang

bbe651a

Merge branch 'main' into fix/compute-node-hang

4cf6bff

cmeesters self-requested a review April 17, 2026 08:44

cmeesters approved these changes Apr 17, 2026

View reviewed changes

cmeesters merged commit a09a027 into snakemake:main Apr 17, 2026
4 of 5 checks passed

snakemake-bot mentioned this pull request Apr 17, 2026

chore(main): release 2.6.2 #453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pipeline hangs when submitting from compute nodes#450

fix: pipeline hangs when submitting from compute nodes#450
cmeesters merged 4 commits intosnakemake:mainfrom
jayhesselberth:fix/compute-node-hang

jayhesselberth commented Apr 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

cmeesters commented Apr 5, 2026

Uh oh!

cmeesters commented Apr 15, 2026

Uh oh!

jayhesselberth commented Apr 15, 2026

Uh oh!

cmeesters left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhesselberth commented Apr 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cmeesters commented Apr 5, 2026

Uh oh!

cmeesters commented Apr 15, 2026

Uh oh!

jayhesselberth commented Apr 15, 2026

Uh oh!

cmeesters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayhesselberth commented Apr 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading