Skip to content

(#2615) use NV_GPU for visible device list#2718

Merged
bghira merged 1 commit into
mainfrom
bugfix/klein2-early-oom
May 19, 2026
Merged

(#2615) use NV_GPU for visible device list#2718
bghira merged 1 commit into
mainfrom
bugfix/klein2-early-oom

Conversation

@bghira
Copy link
Copy Markdown
Owner

@bghira bghira commented May 19, 2026

Closes #2615

This pull request improves how device selection and process launching are handled in the training workflow, and adds comprehensive tests to ensure correct behavior in various scenarios. The main changes include normalizing device selection logic, ensuring provider GPU assignments are respected, preventing duplicate accelerator flags, and enhancing test coverage for these cases.

Device selection and environment handling:

  • Added a _normalize_visible_device_list helper to robustly parse and validate device lists from environment variables, ensuring consistent device selection logic.
  • Updated the process launch logic to use provider GPU assignments (from NV_GPU or NVIDIA_VISIBLE_DEVICES) as a fallback when CUDA_VISIBLE_DEVICES is unset, after normalization.

Accelerate launch command improvements:

  • Added logic to automatically append --multi_gpu to the accelerate launch command when multiple processes are requested, unless a mutually exclusive accelerator selector (like --use_fsdp) is already present in extra args, preventing duplicate or conflicting flags.
  • Refactored the code for parsing extra accelerator arguments and selector detection, improving maintainability and correctness. [1] [2]

Test enhancements:

  • Refactored and expanded tests in tests/test_trainer.py to cover single and multi-GPU selection, provider GPU assignment fallback, and prevention of duplicate accelerator flags, ensuring the new logic is robust and correct. [1] [2] [3]

This comment was marked as low quality.

@bghira bghira merged commit 01b0a8e into main May 19, 2026
3 checks passed
@bghira bghira deleted the bugfix/klein2-early-oom branch May 19, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi GPU Klein9B training early OOM

2 participants