Releases: deepspeedai/DeepSpeed
Releases · deepspeedai/DeepSpeed
v0.18.9 Patch Release
What's Changed
- Respect
$TRITON_HOMEby @Flamefire in #7907 - Add Feature Universal Checkpoint for AutoTP by @nathon-lee in #7908
- fix: remove unnecessary shell=True in ROCm GPU architecture detection by @instantraaamen in #7915
- Don't detect local GPU if
$DS_IGNORE_CUDA_DETECTIONis set by @Flamefire in #7896 - Add HuggingFace tp_plan support for AutoTP by @delock in #7901
- fix: handle non-existent path in is_nfs_path for Triton autotune cache by @Krishnachaitanyakc in #7921
- Fix backward compatibility of torch.amp.custom_fwd for PyTorch < 2.4 by @tohtana in #7920
- Extending Muon Optimizer Support for ZeRO Stage 3 by @PKUWZP in #7919
- Add news item for ASPLOS 2026 Best Paper Award by @PKUWZP in #7923
- fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) by @xylian86 in #7906
- AGENTS.md: Add pre-commit command to existing CI requirements line by @delock in #7930
- Update README with latest news from DeepSpeed by @PKUWZP in #7931
- Merging AutoSP into DeepSpeed by @neeldani in #7860
- Add fallback to full test by @tohtana in #7933
- Remove Microsoft Corporation copyright from AGENTS.md and CLAUDE.md by @PKUWZP in #7932
- Update version.txt for latest incoming release 0.18.9 by @loadams in #7935
New Contributors
- @instantraaamen made their first contribution in #7915
- @Krishnachaitanyakc made their first contribution in #7921
- @neeldani made their first contribution in #7860
Full Changelog: v0.18.8...v0.18.9
v0.18.8 Patch Release
What's Changed
- Suppress see_memory_usage logs by @sfc-gh-truwase in #7891
- [Bloom] Fix hangs of bloom test by @k-artem in #7890
- double reduction user-friendly error by @stas00 in #7895
- Fix async_io ops building error on Huawei Ascend NPU by @huangyifan0610 in #7894
- Fix Evoformer's multi-arch dispatch root cause by @tohtana in #7881
- fix: Validate fp16.loss_scale is finite and non-negative by @nathon-lee in #7889
- Add AGENTS.md and CLAUDE.md with project rules for AI coding agents by @delock in #7902
- fix(zero3): use current_stream() instead of default_stream() for grad… by @michaelroyzen in #7898
- Update version by @loadams in #7903
New Contributors
- @huangyifan0610 made their first contribution in #7894
- @michaelroyzen made their first contribution in #7898
Full Changelog: v0.18.7...v0.18.8
v0.18.7 Patch Release
What's Changed
- Update version post release by @loadams in #7850
- Z1/2 init: flatten params on device by @ksugama in #7828
- Enable shm_comm support for arm by @phalani-paladugu in #7800
- Add news entry for DeepSpeed updates by @PKUWZP in #7854
- Add EXAONE 4.0 model support for Inference V2 by @Bias92 in #7853
- Fix ROCm BF16 conversion intrinsics in inference v2 (#7843) by @tohtana in #7846
- Fix compilation of Evoformer by @Flamefire in #7862
- Throw error when parameter is modified in GatheredParameters by @tohtana in #7832
- Fix Zero-3 static scale assertion in fp16 test by @tohtana in #7866
- Schedule nightly full test by @tohtana in #7870
- Fix broken links and add AutoTP Training tutorial to sidebar nav by @tohtana in #7874
- fix: replace 35 bare except clauses with except Exception by @haosenwang1018 in #7873
- perf: use deque for FIFO queues in sequence parallel, superoffload, and compile by @giulio-leone in #7880
- Fix: only add parameter with grads to parameter group by @delock in #7869
- Fix no-grad grad-fn lookup in ZeRO hook counting on PyTorch 2.3 (#7830) by @tohtana in #7841
- Fix import deepspeed crash on PyTorch v2.3 + Python 3.12 by @tohtana in #7875
- XPU use stock pytorch instead of Intel Extension for PyTorch by @delock in #7877
- Remove amp() from abstract accelerator by @delock in #7879
- Add document section explaining autocast nesting by @tohtana in #7883
- Fix hook count performance regression from v0.18.5 by @tohtana in #7886
New Contributors
- @ksugama made their first contribution in #7828
- @phalani-paladugu made their first contribution in #7800
- @Bias92 made their first contribution in #7853
- @haosenwang1018 made their first contribution in #7873
- @giulio-leone made their first contribution in #7880
Full Changelog: v0.18.6...v0.18.7
v0.18.6 Patch Release
What's Changed
- Update version.txt to 0.18.6 after latest release by @loadams in #7826
- Fix leaf module race condition by @tohtana in #7825
- Skip sequence parallel operations during eval by @jp1924 in #7821
- Support custom partitioning patterns for AutoTP by @tohtana in #7806
- Fix gradient is ready with z2 by @sfc-gh-truwase in #7829
- Fix AutoTP custom patterns: respect use_default_specs by @tohtana in #7827
- Support new python 3.14 annotation handling by @sdvillal in #7831
- fix: replace deprecated fractions.gcd with math.gcd by @Mr-Neutr0n in #7845
- Fix bf16 gradient norm divergence with ZeRO stage 0 by @tohtana in #7839
- Replace torch.jit.script with torch.compile (#7835) by @tohtana in #7840
New Contributors
- @jp1924 made their first contribution in #7821
- @Mr-Neutr0n made their first contribution in #7845
Full Changelog: v0.18.5...v0.18.6
v0.18.5 Patch Release
What's Changed
- Update version.txt after 0.18.4 release by @loadams in #7765
- Various fixes to run on mps by @jeffra in #7767
- Udpate workflow trigger by @tohtana in #7768
- fix: delete using namespace std. by @nathon-lee in #7766
- fix: update Megatron-DeepSpeed tutorial to match current repo structure by @nathon-lee in #7761
- Add timeout to test workflows by @tohtana in #7774
- Remove cron/PR triggers for outdated V100 tests by @loadams in #7777
- [Docs] Fix
docs/_pages/config-json.mdformat by @ooooo-create in #7779 - Update CLA to refer to DCO by @loadams in #7778
- Fix multiprocessing testcase by @k-artem in #7743
- fix: skip compressed allreduce for empty tensors by @T1mn in #7769
- docs: update README.md by @eltociear in #7781
- Fix gradient checkpointing with use_reentrant=True / PyTorch-style backward / ZeRO-3 by @tohtana in #7780
- Fix Ulysses PEFT test by @tohtana in #7784
- Fix Evoformer compilation by @sdvillal in #7760
- fix checkpointing/loading of z0+bf16 by @tohtana in #7786
- Add sequential allgather optimization for ZeRO-3 by @aeeeeeep in #7661
- Fix AutoTP test numerical tolerance with rtol by @tohtana in #7794
- Fix backward for pipeline engine by @tohtana in #7787
- Skip empty parameters in gradient reduction by @tohtana in #7789
- Fix issue with BF16 optimizer selection by @tohtana in #7788
- Fix BF16_Optimizer being used without ZeRO by @tohtana in #7790
- Add full test suite workflow by @tohtana in #7795
- Fix Muon optimizer module path by @tohtana in #7802
- Fix ping-pong buffer index reset and removing redundant stream sync by @undersilence in #7805
- Fix ZeRO stage to choose BF16 optimizer in test by @tohtana in #7803
- Run Evoformer tests sequentially by @tohtana in #7810
- Improve engine's cleanup by @tohtana in #7813
- Ignore evoformer test by @tohtana in #7815
- Fix typos in accelerator setup guide by @nathon-lee in #7818
- Raise clear error on in-place GatheredParameters edits without modifier_rank by @tohtana in #7817
- [Bugfix] Resolve Rank index out of range during BWD when sp_size < world_size in Ulysses by @Flink-ddd in #7809
- Update PyTorch to v2.9 for modal tests by @tohtana in #7816
New Contributors
- @ooooo-create made their first contribution in #7779
- @T1mn made their first contribution in #7769
- @sdvillal made their first contribution in #7760
- @undersilence made their first contribution in #7805
Full Changelog: v0.18.4...v0.18.5
v0.18.4 Patch Release
What's Changed
- Update version by @sfc-gh-truwase in #7719
- Disable deterministic option in compile tests by @tohtana in #7720
- Fix SuperOffloadOptimizer_Stage3 crash due to missing param_names parameter by @ImaGoodFella in #7715
- [AMD][ROCm] Improve support of AMD by @k-artem in #7448
- fix typo by @stas00 in #7722
- Skip none in backward hook by @tohtana in #7725
- [Engine] Only scale gradients if scale_wrt_gas is True by @kashif in #7724
- Fix testcases that depends on triton by @k-artem in #7731
- Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL by @xylian86 in #7727
- Fix #7733: Replace torch.sqrt with math.sqrt in scale_lr for sqrt method by @Rakshit-gen in #7735
- replace moe checkpoint dp_world_size with seq_dp_world_size by @wukong1992 in #7732
- [BUG] Fix UlyssesSPAttentionHF.register_with_transformers() crash with PEFT models by @Rakshit-gen in #7737
- Add core api update blog by @tohtana in #7738
- Fix Nebula checkpoint engine commit() API mismatch by @Rakshit-gen in #7740
- Fix DecoupledCheckpointEngine deadlock and improve reliability by @Rakshit-gen in #7742
- Fix OnebitLamb NaN propagation with empty parameters by @Rakshit-gen in #7736
- fix: remove premature MPI environment variable check in OpenMPIRunner by @leejianwoo-collab in #7751
- Enable python 3.11 and 3.12 tests by @loadams in #7007
- Add CI workflow to run tests on AWS by @tohtana in #7753
- Add fallback to BF16 support check by @tohtana in #7754
- Fix DeepCompile for PyTorch 2.8/2.9 compatibility by @tohtana in #7755
- Removed amp testcases by @k-artem in #7745
- fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim by @leejianwoo-collab in #7763
New Contributors
- @ImaGoodFella made their first contribution in #7715
- @k-artem made their first contribution in #7448
- @kashif made their first contribution in #7724
- @Rakshit-gen made their first contribution in #7735
- @leejianwoo-collab made their first contribution in #7751
Full Changelog: v0.18.3...v0.18.4
v0.18.3 Patch Release
What's Changed
- Update version.txt after release by @loadams in #7675
- [modal ci] fixes by @stas00 in #7676
- leaf modules: explain better by @stas00 in #7674
- disable nv-lightning-v100.yml cI by @stas00 in #7681
- allow seperate learning rate "muon_lr" and "adam_lr" for muon optimizer by @delock in #7658
- see_mem_usage: make always work by @stas00 in #7688
- make debug utils more resilient by @stas00 in #7690
- zero stage 1-2: don't pin memory if not configured by @stas00 in #7689
- modal ci: fix group concurrency by @stas00 in #7691
- Use pytorch utils to detect ninja by @Emrys-Merlin in #7687
- Update SECURITY.md to point to GitHub reporting rather than Microsoft by @loadams in #7692
- Add Qwen2.5 to AutoTP model list by @delock in #7696
- Trust intel server for XPU tests by @tohtana in #7698
- PyTorch-compatible backward API by @tohtana in #7665
- Add news about Ray x DeepSpeed Meetup by @PKUWZP in #7704
- Put Muon optimizer momentum buffer on GPU by @delock in #7648
- [ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases by @rraminen in #7655
- Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. by @zhengchenyu in #7707
- fix: skip aio wait when swap tensors is empty by @xylian86 in #7712
- Low-precision master params/grads/optimizer states by @tohtana in #7700
- Enabled compiled autograd for backward pass by @deepcharm in #7667
- Wall clock timers API by @sfc-gh-truwase in #7714
New Contributors
- @Emrys-Merlin made their first contribution in #7687
Full Changelog: v0.18.2...v0.18.3
v0.18.2 Patch Release
What's Changed
- Update version after 0.18.1 release by @loadams in #7647
- Deduplicate fp32 weights under torch autocast and ZeRO3 by @eternalNight in #7651
- ulysses mpu: additional api by @stas00 in #7649
- ALST/UlyssesSP: more intuitive API wrt variable seqlen by @stas00 in #7656
- Fix misplaced overflow handling return in fused_optimizer.py by @rraminen in #7645
- [bug]: fixed comm_dtype in extra_large_param_to_reduce by @therealnaveenkamal in #7660
- UlyssesSP: TiledMLP doc - recomputes forward twice by @stas00 in #7664
- resolved a 0-dim tensor slicing bug from _get_state_without_padding by @therealnaveenkamal in #7659
- Fix typo in pytorch-profiler.md documentation by @kunheek in #7652
- README refresh by @sfc-gh-truwase in #7668
New Contributors
Full Changelog: v0.18.1...v0.18.2
v0.18.1 Patch Release
What's Changed
- Add ZenFlow code for Stage 3 by @JoshWoo2003 in #7516
- [XPU][CI] recover xpu-max1100 workflow by @Liangliang-Ma in #7630
- Take **kwargs in init of DeepSpeedZeroOptimizer subclasses by @eternalNight in #7634
- add support for tensor learning rate (vs scalar) by @NirSonnenschein in #7633
- Fix illegal memory access with multi_tensor_apply size above INT_MAX by @wangyan-mms in #7639
- No Muon optimizer for embeding and lm_head layer by @delock in #7641
- z2: report param name and not zero id in assert by @stas00 in #7637
- z2: don't pass
dtypetoreport_ipg_memory_usageby @stas00 in #7636 - Ulysses HF Accelerate integration by @stas00 in #7638
- Add DataStates-LLM: Asynchronous Checkpointing Engine Support by @mauryaavinash95 in #7166
New Contributors
- @JoshWoo2003 made their first contribution in #7516
- @wangyan-mms made their first contribution in #7639
Full Changelog: v0.18.0...v0.18.1
v0.18.0
What's Changed
- Update version.txt post 0.17.6 release by @loadams in #7572
- DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… by @juyterman1000 in #7489
- logging: Also set log level of logger handlers by @eternalNight in #7576
- Deepcompile: Fix bugs when applying deepcompile to VLA-like models by @eternalNight in #7569
- Broadcast fp16 overflow in Z1 by @sfc-gh-truwase in #7580
- Deepcompile: Make size of activation to free configurable by @eternalNight in #7582
- SuperOffload Release by @xylian86 in #7559
- Include init file for superoffload folder by @nguyen599 in #7591
- disables ZeRO checkpoint loading path when stage=0 by @therealnaveenkamal in #7586
- Simplify leaf module hook by @tohtana in #7592
- Fix the universal checkpoint issue for stage3 when there are multiple subgroups. by @zhengchenyu in #7585
- Change current_device() to current_device_name() by @delock in #7600
- Fixed the problem of loading universal checkpoint error in multi-machine mode. by @zhengchenyu in #7601
- DeepCompile: Specify tensor aliasing in C++ op schema by @eternalNight in #7597
- DeepCompile: Fuse allgather and downcast by @eternalNight in #7588
- Add blog for SuperOffload by @xylian86 in #7594
- Add venv to .gitignore by @zhengchenyu in #7605
- Handle the case of DeepCompile's enabled but not activated by @tohtana in #7603
- DeepCompile: Fix IPG bucket clearing by @eternalNight in #7610
- Minor fix in the SuperOffload blog by @xylian86 in #7612
- Fixed the issue that universal checkpoint cannot be loaded for stage3 when world size expansion. by @zhengchenyu in #7599
- Fixed save_checkpoint race when consolidating NVMe offloaded tensors by @H1manshu21 in #7613
- [wall_clock_breakdown] always log stats when enabled by @stas00 in #7617
- DeepCompile: Use min_cut_rematerialization for partitioning joint graphs by @eternalNight in #7609
- Show mismatching values when DeepCompile test fails by @tohtana in #7618
- Improve leaf module interface (enable via config, relax matching criteria, add document, etc.) by @tohtana in #7604
- add print_dist util by @stas00 in #7621
- Super offload blog Chinese version by @delock in #7620
- Enable grad scaler for ZeRO-0 + torch.autocast path by @tohtana in #7619
- Blog of zenflow binding study by @delock in #7614
- Clarify document of leaf module config by @tohtana in #7623
- [TiledMLP] moe support by @stas00 in #7622
- Update email address by @sfc-gh-truwase in #7624
New Contributors
- @juyterman1000 made their first contribution in #7489
- @nguyen599 made their first contribution in #7591
- @zhengchenyu made their first contribution in #7585
- @H1manshu21 made their first contribution in #7613
Full Changelog: v0.17.6...v0.18.0