Releases · deepspeedai/DeepSpeed

30 Mar 16:42

loadams

v0.18.9

8c93851

v0.18.9 Patch Release Latest

Latest

What's Changed

Respect $TRITON_HOME by @Flamefire in #7907
Add Feature Universal Checkpoint for AutoTP by @nathon-lee in #7908
fix: remove unnecessary shell=True in ROCm GPU architecture detection by @instantraaamen in #7915
Don't detect local GPU if $DS_IGNORE_CUDA_DETECTION is set by @Flamefire in #7896
Add HuggingFace tp_plan support for AutoTP by @delock in #7901
fix: handle non-existent path in is_nfs_path for Triton autotune cache by @Krishnachaitanyakc in #7921
Fix backward compatibility of torch.amp.custom_fwd for PyTorch < 2.4 by @tohtana in #7920
Extending Muon Optimizer Support for ZeRO Stage 3 by @PKUWZP in #7919
Add news item for ASPLOS 2026 Best Paper Award by @PKUWZP in #7923
fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) by @xylian86 in #7906
AGENTS.md: Add pre-commit command to existing CI requirements line by @delock in #7930
Update README with latest news from DeepSpeed by @PKUWZP in #7931
Merging AutoSP into DeepSpeed by @neeldani in #7860
Add fallback to full test by @tohtana in #7933
Remove Microsoft Corporation copyright from AGENTS.md and CLAUDE.md by @PKUWZP in #7932
Update version.txt for latest incoming release 0.18.9 by @loadams in #7935

New Contributors

@instantraaamen made their first contribution in #7915
@Krishnachaitanyakc made their first contribution in #7921
@neeldani made their first contribution in #7860

Full Changelog: v0.18.8...v0.18.9

Contributors

Flamefire, PKUWZP, and 8 other contributors

Assets 2

13 Mar 18:48

loadams

v0.18.8

5f7b687

v0.18.8 Patch Release

What's Changed

Suppress see_memory_usage logs by @sfc-gh-truwase in #7891
[Bloom] Fix hangs of bloom test by @k-artem in #7890
double reduction user-friendly error by @stas00 in #7895
Fix async_io ops building error on Huawei Ascend NPU by @huangyifan0610 in #7894
Fix Evoformer's multi-arch dispatch root cause by @tohtana in #7881
fix: Validate fp16.loss_scale is finite and non-negative by @nathon-lee in #7889
Add AGENTS.md and CLAUDE.md with project rules for AI coding agents by @delock in #7902
fix(zero3): use current_stream() instead of default_stream() for grad… by @michaelroyzen in #7898
Update version by @loadams in #7903

New Contributors

@huangyifan0610 made their first contribution in #7894
@michaelroyzen made their first contribution in #7898

Full Changelog: v0.18.7...v0.18.8

Contributors

k-artem, stas00, and 7 other contributors

Assets 2

05 Mar 20:43

loadams

v0.18.7

6c59d54

v0.18.7 Patch Release

What's Changed

Update version post release by @loadams in #7850
Z1/2 init: flatten params on device by @ksugama in #7828
Enable shm_comm support for arm by @phalani-paladugu in #7800
Add news entry for DeepSpeed updates by @PKUWZP in #7854
Add EXAONE 4.0 model support for Inference V2 by @Bias92 in #7853
Fix ROCm BF16 conversion intrinsics in inference v2 (#7843) by @tohtana in #7846
Fix compilation of Evoformer by @Flamefire in #7862
Throw error when parameter is modified in GatheredParameters by @tohtana in #7832
Fix Zero-3 static scale assertion in fp16 test by @tohtana in #7866
Schedule nightly full test by @tohtana in #7870
Fix broken links and add AutoTP Training tutorial to sidebar nav by @tohtana in #7874
fix: replace 35 bare except clauses with except Exception by @haosenwang1018 in #7873
perf: use deque for FIFO queues in sequence parallel, superoffload, and compile by @giulio-leone in #7880
Fix: only add parameter with grads to parameter group by @delock in #7869
Fix no-grad grad-fn lookup in ZeRO hook counting on PyTorch 2.3 (#7830) by @tohtana in #7841
Fix import deepspeed crash on PyTorch v2.3 + Python 3.12 by @tohtana in #7875
XPU use stock pytorch instead of Intel Extension for PyTorch by @delock in #7877
Remove amp() from abstract accelerator by @delock in #7879
Add document section explaining autocast nesting by @tohtana in #7883
Fix hook count performance regression from v0.18.5 by @tohtana in #7886

New Contributors

@ksugama made their first contribution in #7828
@phalani-paladugu made their first contribution in #7800
@Bias92 made their first contribution in #7853
@haosenwang1018 made their first contribution in #7873
@giulio-leone made their first contribution in #7880

Full Changelog: v0.18.6...v0.18.7

Contributors

Flamefire, giulio-leone, and 8 other contributors

Assets 2

12 Feb 23:21

loadams

v0.18.6

d2ca6e7

v0.18.6 Patch Release

What's Changed

Update version.txt to 0.18.6 after latest release by @loadams in #7826
Fix leaf module race condition by @tohtana in #7825
Skip sequence parallel operations during eval by @jp1924 in #7821
Support custom partitioning patterns for AutoTP by @tohtana in #7806
Fix gradient is ready with z2 by @sfc-gh-truwase in #7829
Fix AutoTP custom patterns: respect use_default_specs by @tohtana in #7827
Support new python 3.14 annotation handling by @sdvillal in #7831
fix: replace deprecated fractions.gcd with math.gcd by @Mr-Neutr0n in #7845
Fix bf16 gradient norm divergence with ZeRO stage 0 by @tohtana in #7839
Replace torch.jit.script with torch.compile (#7835) by @tohtana in #7840

New Contributors

@jp1924 made their first contribution in #7821
@Mr-Neutr0n made their first contribution in #7845

Full Changelog: v0.18.5...v0.18.6

Contributors

sdvillal, Mr-Neutr0n, and 4 other contributors

Assets 2

30 Jan 18:21

loadams

v0.18.5

b19987c

v0.18.5 Patch Release

What's Changed

Update version.txt after 0.18.4 release by @loadams in #7765
Various fixes to run on mps by @jeffra in #7767
Udpate workflow trigger by @tohtana in #7768
fix: delete using namespace std. by @nathon-lee in #7766
fix: update Megatron-DeepSpeed tutorial to match current repo structure by @nathon-lee in #7761
Add timeout to test workflows by @tohtana in #7774
Remove cron/PR triggers for outdated V100 tests by @loadams in #7777
[Docs] Fix docs/_pages/config-json.md format by @ooooo-create in #7779
Update CLA to refer to DCO by @loadams in #7778
Fix multiprocessing testcase by @k-artem in #7743
fix: skip compressed allreduce for empty tensors by @T1mn in #7769
docs: update README.md by @eltociear in #7781
Fix gradient checkpointing with use_reentrant=True / PyTorch-style backward / ZeRO-3 by @tohtana in #7780
Fix Ulysses PEFT test by @tohtana in #7784
Fix Evoformer compilation by @sdvillal in #7760
fix checkpointing/loading of z0+bf16 by @tohtana in #7786
Add sequential allgather optimization for ZeRO-3 by @aeeeeeep in #7661
Fix AutoTP test numerical tolerance with rtol by @tohtana in #7794
Fix backward for pipeline engine by @tohtana in #7787
Skip empty parameters in gradient reduction by @tohtana in #7789
Fix issue with BF16 optimizer selection by @tohtana in #7788
Fix BF16_Optimizer being used without ZeRO by @tohtana in #7790
Add full test suite workflow by @tohtana in #7795
Fix Muon optimizer module path by @tohtana in #7802
Fix ping-pong buffer index reset and removing redundant stream sync by @undersilence in #7805
Fix ZeRO stage to choose BF16 optimizer in test by @tohtana in #7803
Run Evoformer tests sequentially by @tohtana in #7810
Improve engine's cleanup by @tohtana in #7813
Ignore evoformer test by @tohtana in #7815
Fix typos in accelerator setup guide by @nathon-lee in #7818
Raise clear error on in-place GatheredParameters edits without modifier_rank by @tohtana in #7817
[Bugfix] Resolve Rank index out of range during BWD when sp_size < world_size in Ulysses by @Flink-ddd in #7809
Update PyTorch to v2.9 for modal tests by @tohtana in #7816

New Contributors

@ooooo-create made their first contribution in #7779
@T1mn made their first contribution in #7769
@sdvillal made their first contribution in #7760
@undersilence made their first contribution in #7805

Full Changelog: v0.18.4...v0.18.5

Contributors

sdvillal, jeffra, and 10 other contributors

Assets 2

07 Jan 22:58

loadams

v0.18.4

b35d9eb

v0.18.4 Patch Release

What's Changed

Update version by @sfc-gh-truwase in #7719
Disable deterministic option in compile tests by @tohtana in #7720
Fix SuperOffloadOptimizer_Stage3 crash due to missing param_names parameter by @ImaGoodFella in #7715
[AMD][ROCm] Improve support of AMD by @k-artem in #7448
fix typo by @stas00 in #7722
Skip none in backward hook by @tohtana in #7725
[Engine] Only scale gradients if scale_wrt_gas is True by @kashif in #7724
Fix testcases that depends on triton by @k-artem in #7731
Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL by @xylian86 in #7727
Fix #7733: Replace torch.sqrt with math.sqrt in scale_lr for sqrt method by @Rakshit-gen in #7735
replace moe checkpoint dp_world_size with seq_dp_world_size by @wukong1992 in #7732
[BUG] Fix UlyssesSPAttentionHF.register_with_transformers() crash with PEFT models by @Rakshit-gen in #7737
Add core api update blog by @tohtana in #7738
Fix Nebula checkpoint engine commit() API mismatch by @Rakshit-gen in #7740
Fix DecoupledCheckpointEngine deadlock and improve reliability by @Rakshit-gen in #7742
Fix OnebitLamb NaN propagation with empty parameters by @Rakshit-gen in #7736
fix: remove premature MPI environment variable check in OpenMPIRunner by @leejianwoo-collab in #7751
Enable python 3.11 and 3.12 tests by @loadams in #7007
Add CI workflow to run tests on AWS by @tohtana in #7753
Add fallback to BF16 support check by @tohtana in #7754
Fix DeepCompile for PyTorch 2.8/2.9 compatibility by @tohtana in #7755
Removed amp testcases by @k-artem in #7745
fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim by @leejianwoo-collab in #7763

New Contributors

@ImaGoodFella made their first contribution in #7715
@k-artem made their first contribution in #7448
@kashif made their first contribution in #7724
@Rakshit-gen made their first contribution in #7735
@leejianwoo-collab made their first contribution in #7751

Full Changelog: v0.18.3...v0.18.4

Contributors

kashif, k-artem, and 9 other contributors

Assets 2

09 Dec 14:48

sfc-gh-truwase

v0.18.3

6eb98aa

v0.18.3 Patch Release

What's Changed

Update version.txt after release by @loadams in #7675
[modal ci] fixes by @stas00 in #7676
leaf modules: explain better by @stas00 in #7674
disable nv-lightning-v100.yml cI by @stas00 in #7681
allow seperate learning rate "muon_lr" and "adam_lr" for muon optimizer by @delock in #7658
see_mem_usage: make always work by @stas00 in #7688
make debug utils more resilient by @stas00 in #7690
zero stage 1-2: don't pin memory if not configured by @stas00 in #7689
modal ci: fix group concurrency by @stas00 in #7691
Use pytorch utils to detect ninja by @Emrys-Merlin in #7687
Update SECURITY.md to point to GitHub reporting rather than Microsoft by @loadams in #7692
Add Qwen2.5 to AutoTP model list by @delock in #7696
Trust intel server for XPU tests by @tohtana in #7698
PyTorch-compatible backward API by @tohtana in #7665
Add news about Ray x DeepSpeed Meetup by @PKUWZP in #7704
Put Muon optimizer momentum buffer on GPU by @delock in #7648
[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases by @rraminen in #7655
Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. by @zhengchenyu in #7707
fix: skip aio wait when swap tensors is empty by @xylian86 in #7712
Low-precision master params/grads/optimizer states by @tohtana in #7700
Enabled compiled autograd for backward pass by @deepcharm in #7667
Wall clock timers API by @sfc-gh-truwase in #7714

New Contributors

@Emrys-Merlin made their first contribution in #7687

Full Changelog: v0.18.2...v0.18.3

Contributors

Emrys-Merlin, PKUWZP, and 9 other contributors

Assets 2

05 Nov 19:21

loadams

v0.18.2

e993fea

v0.18.2 Patch Release

What's Changed

Update version after 0.18.1 release by @loadams in #7647
Deduplicate fp32 weights under torch autocast and ZeRO3 by @eternalNight in #7651
ulysses mpu: additional api by @stas00 in #7649
ALST/UlyssesSP: more intuitive API wrt variable seqlen by @stas00 in #7656
Fix misplaced overflow handling return in fused_optimizer.py by @rraminen in #7645
[bug]: fixed comm_dtype in extra_large_param_to_reduce by @therealnaveenkamal in #7660
UlyssesSP: TiledMLP doc - recomputes forward twice by @stas00 in #7664
resolved a 0-dim tensor slicing bug from _get_state_without_padding by @therealnaveenkamal in #7659
Fix typo in pytorch-profiler.md documentation by @kunheek in #7652
README refresh by @sfc-gh-truwase in #7668

New Contributors

@kunheek made their first contribution in #7652

Full Changelog: v0.18.1...v0.18.2

Contributors

eternalNight, stas00, and 5 other contributors

Assets 2

23 Oct 16:39

loadams

v0.18.1

3631712

v0.18.1 Patch Release

What's Changed

Add ZenFlow code for Stage 3 by @JoshWoo2003 in #7516
[XPU][CI] recover xpu-max1100 workflow by @Liangliang-Ma in #7630
Take **kwargs in init of DeepSpeedZeroOptimizer subclasses by @eternalNight in #7634
add support for tensor learning rate (vs scalar) by @NirSonnenschein in #7633
Fix illegal memory access with multi_tensor_apply size above INT_MAX by @wangyan-mms in #7639
No Muon optimizer for embeding and lm_head layer by @delock in #7641
z2: report param name and not zero id in assert by @stas00 in #7637
z2: don't pass dtype to report_ipg_memory_usage by @stas00 in #7636
Ulysses HF Accelerate integration by @stas00 in #7638
Add DataStates-LLM: Asynchronous Checkpointing Engine Support by @mauryaavinash95 in #7166

New Contributors

@JoshWoo2003 made their first contribution in #7516
@wangyan-mms made their first contribution in #7639

Full Changelog: v0.18.0...v0.18.1

Contributors

eternalNight, mauryaavinash95, and 6 other contributors

Assets 2

07 Oct 23:27

sfc-gh-truwase

v0.18.0

79caae1

v0.18.0

What's Changed

Update version.txt post 0.17.6 release by @loadams in #7572
DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… by @juyterman1000 in #7489
logging: Also set log level of logger handlers by @eternalNight in #7576
Deepcompile: Fix bugs when applying deepcompile to VLA-like models by @eternalNight in #7569
Broadcast fp16 overflow in Z1 by @sfc-gh-truwase in #7580
Deepcompile: Make size of activation to free configurable by @eternalNight in #7582
SuperOffload Release by @xylian86 in #7559
Include init file for superoffload folder by @nguyen599 in #7591
disables ZeRO checkpoint loading path when stage=0 by @therealnaveenkamal in #7586
Simplify leaf module hook by @tohtana in #7592
Fix the universal checkpoint issue for stage3 when there are multiple subgroups. by @zhengchenyu in #7585
Change current_device() to current_device_name() by @delock in #7600
Fixed the problem of loading universal checkpoint error in multi-machine mode. by @zhengchenyu in #7601
DeepCompile: Specify tensor aliasing in C++ op schema by @eternalNight in #7597
DeepCompile: Fuse allgather and downcast by @eternalNight in #7588
Add blog for SuperOffload by @xylian86 in #7594
Add venv to .gitignore by @zhengchenyu in #7605
Handle the case of DeepCompile's enabled but not activated by @tohtana in #7603
DeepCompile: Fix IPG bucket clearing by @eternalNight in #7610
Minor fix in the SuperOffload blog by @xylian86 in #7612
Fixed the issue that universal checkpoint cannot be loaded for stage3 when world size expansion. by @zhengchenyu in #7599
Fixed save_checkpoint race when consolidating NVMe offloaded tensors by @H1manshu21 in #7613
[wall_clock_breakdown] always log stats when enabled by @stas00 in #7617
DeepCompile: Use min_cut_rematerialization for partitioning joint graphs by @eternalNight in #7609
Show mismatching values when DeepCompile test fails by @tohtana in #7618
Improve leaf module interface (enable via config, relax matching criteria, add document, etc.) by @tohtana in #7604
add print_dist util by @stas00 in #7621
Super offload blog Chinese version by @delock in #7620
Enable grad scaler for ZeRO-0 + torch.autocast path by @tohtana in #7619
Blog of zenflow binding study by @delock in #7614
Clarify document of leaf module config by @tohtana in #7623
[TiledMLP] moe support by @stas00 in #7622
Update email address by @sfc-gh-truwase in #7624

New Contributors

@juyterman1000 made their first contribution in #7489
@nguyen599 made their first contribution in #7591
@zhengchenyu made their first contribution in #7585
@H1manshu21 made their first contribution in #7613

Full Changelog: v0.17.6...v0.18.0

Contributors

eternalNight, zhengchenyu, and 10 other contributors

Assets 2

Releases: deepspeedai/DeepSpeed

v0.18.9 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.8 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.7 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.6 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.5 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.4 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.3 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.2 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.1 Patch Release

What's Changed

New Contributors

Contributors

Uh oh!

v0.18.0

What's Changed

New Contributors

Contributors

Uh oh!