Use case
nemo-skills' RayExecutor is a downstream Executor subclass that needs to be passable to JobGroup. Today JobGroup.__post_init__ (around nemo_run/run/job.py:265) asserts isinstance(self.executor, JobGroup.SUPPORTED_EXECUTORS) where SUPPORTED_EXECUTORS = [SlurmExecutor, DockerExecutor, LocalExecutor]. Any downstream package adding a new Executor type — Ray, Kubernetes, etc. — hits this assertion and cannot construct a JobGroup.
Why this matters now
nemo-skills now ships RayExecutor and uses JobGroup for multi-script eval-generation flows (vLLM + sandbox + client co-located). The Ray multi-script path is a separate architectural concern (single Ray submission = single container, vs. the heterogeneous group semantics JobGroup was designed for on Slurm) — but the immediate question is whether downstream Executor subclasses are even a supported extension point.
Current workaround
(Marking this clearly as workaround, not a proposed PR.) A downstream project patches the assertion at runtime via a class-name string sniff (`type(self.executor).name == "RayExecutor"`) to avoid a circular import. This sniff is intentionally narrow but a class-name string match is not idiomatic for upstream.
Proposed designs
Please indicate preference before we send a PR.
1. `SUPPORTED_EXECUTORS` extension hook
Downstream packages register their Executor subclass at import time:
```python
from nemo_run.run.job import JobGroup
JobGroup.SUPPORTED_EXECUTORS = (*JobGroup.SUPPORTED_EXECUTORS, RayExecutor)
```
- Pro: explicit, discoverable.
- Con: requires downstream packages to mutate a class attribute, which feels brittle.
2. Sentinel attribute on the Executor subclass
Downstream marks compatibility:
```python
class RayExecutor(Executor):
_jobgroup_compatible = True
```
…and `post_init` checks `getattr(executor, "_jobgroup_compatible", False)` in addition to `isinstance(SUPPORTED_EXECUTORS)`.
- Pro: no mutation of upstream state.
- Con: requires JobGroup to know about the sentinel.
3. `Executor.supports_job_group()` classmethod on the base
Defaults to `False`, overridable downstream. Same shape as option 2 but more discoverable in IDEs.
Note on JobGroup.launch path
Even with the assertion relaxed, `JobGroup.launch` calls `nemo_run.run.torchx_backend.launcher.launch(executor=...)` which routes through `EXECUTOR_MAPPING` in `torchx_backend/schedulers/api.py:30`. That mapping has no Ray entry, so `get_executor_str(RayExecutor)` raises `KeyError`. This means the assertion relax is necessary but not sufficient for Ray multi-script JobGroup to actually launch end-to-end. The pragmatic answer for the multi-script case is multi-pool architecture (pre-host components in separate Ray submissions; collapse multi-script to single-script), but the assertion remains too strict in principle for any downstream Executor subclass.
Reference
Prior PR #410 was the last touch on `nemo_run/run/job.py`. Searching closed issues for "Unsupported executor type" returned 0 hits, so this is a fresh report.
Ask
Which of the three designs (or a fourth) would you accept as a PR? Happy to send code once direction is confirmed.
Use case
nemo-skills'
RayExecutoris a downstreamExecutorsubclass that needs to be passable toJobGroup. TodayJobGroup.__post_init__(aroundnemo_run/run/job.py:265) assertsisinstance(self.executor, JobGroup.SUPPORTED_EXECUTORS)whereSUPPORTED_EXECUTORS = [SlurmExecutor, DockerExecutor, LocalExecutor]. Any downstream package adding a new Executor type — Ray, Kubernetes, etc. — hits this assertion and cannot construct a JobGroup.Why this matters now
nemo-skills now ships
RayExecutorand usesJobGroupfor multi-script eval-generation flows (vLLM + sandbox + client co-located). The Ray multi-script path is a separate architectural concern (single Ray submission = single container, vs. the heterogeneous group semantics JobGroup was designed for on Slurm) — but the immediate question is whether downstream Executor subclasses are even a supported extension point.Current workaround
(Marking this clearly as workaround, not a proposed PR.) A downstream project patches the assertion at runtime via a class-name string sniff (`type(self.executor).name == "RayExecutor"`) to avoid a circular import. This sniff is intentionally narrow but a class-name string match is not idiomatic for upstream.
Proposed designs
Please indicate preference before we send a PR.
1. `SUPPORTED_EXECUTORS` extension hook
Downstream packages register their Executor subclass at import time:
```python
from nemo_run.run.job import JobGroup
JobGroup.SUPPORTED_EXECUTORS = (*JobGroup.SUPPORTED_EXECUTORS, RayExecutor)
```
2. Sentinel attribute on the Executor subclass
Downstream marks compatibility:
```python
class RayExecutor(Executor):
_jobgroup_compatible = True
```
…and `post_init` checks `getattr(executor, "_jobgroup_compatible", False)` in addition to `isinstance(SUPPORTED_EXECUTORS)`.
3. `Executor.supports_job_group()` classmethod on the base
Defaults to `False`, overridable downstream. Same shape as option 2 but more discoverable in IDEs.
Note on JobGroup.launch path
Even with the assertion relaxed, `JobGroup.launch` calls `nemo_run.run.torchx_backend.launcher.launch(executor=...)` which routes through `EXECUTOR_MAPPING` in `torchx_backend/schedulers/api.py:30`. That mapping has no Ray entry, so `get_executor_str(RayExecutor)` raises `KeyError`. This means the assertion relax is necessary but not sufficient for Ray multi-script JobGroup to actually launch end-to-end. The pragmatic answer for the multi-script case is multi-pool architecture (pre-host components in separate Ray submissions; collapse multi-script to single-script), but the assertion remains too strict in principle for any downstream Executor subclass.
Reference
Prior PR #410 was the last touch on `nemo_run/run/job.py`. Searching closed issues for "Unsupported executor type" returned 0 hits, so this is a fresh report.
Ask
Which of the three designs (or a fourth) would you accept as a PR? Happy to send code once direction is confirmed.