Commit 6c59d54
Fix hook count performance regression from v0.18.5 (#7886)
Fixes performance regressions reported in #7882 and #7885.
PR #7780 added dynamic hook count computation for reentrant
checkpointing correctness, but placed the call inside every gradient
hook closure. For a model with n parameter tensors, this creates
significant overhead per backward pass.
Summary:
1. Added `should_refresh_expected_hook_count()` predicate that returns
true only at backward phase boundaries (first hook, or new reentrant
phase), so `count_used_parameters_in_backward()` is called once per
phase instead of once per hook.
2. Applied this predicate in ZeRO-1/2 (stage_1_and_2.py) and both ZeRO-3
hook sites (stage3.py), reusing the `cached_max_expected_hooks_seen`
value when refresh isn't needed.
3. Changed enter_backward() to reset hook counters on first real
backward entry, preventing pollution from pre-user-backward autograd
calls (e.g., TiledFusedLogitsLoss).
With 24-layer transformer, ~267M params (147 parameter tensors), ZeRO-2,
8×H100 80GB, bf16, batch size 8, 20 warmup + 20 measured iterations:
- Before fix: 0.1265s/iter
- After fix: 0.0505s/iter
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com>1 parent 4dba1e2 commit 6c59d54
File tree
4 files changed
+155
-5
lines changed- deepspeed/runtime
- zero
- tests/unit/v1/zero
4 files changed
+155
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
112 | 123 | | |
113 | 124 | | |
114 | 125 | | |
| |||
128 | 139 | | |
129 | 140 | | |
130 | 141 | | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
131 | 158 | | |
132 | 159 | | |
133 | 160 | | |
| |||
401 | 428 | | |
402 | 429 | | |
403 | 430 | | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
404 | 435 | | |
405 | 436 | | |
406 | 437 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1279 | 1279 | | |
1280 | 1280 | | |
1281 | 1281 | | |
| 1282 | + | |
| 1283 | + | |
1282 | 1284 | | |
1283 | 1285 | | |
1284 | 1286 | | |
1285 | 1287 | | |
1286 | 1288 | | |
1287 | 1289 | | |
1288 | | - | |
1289 | | - | |
| 1290 | + | |
| 1291 | + | |
| 1292 | + | |
| 1293 | + | |
| 1294 | + | |
1290 | 1295 | | |
1291 | 1296 | | |
1292 | 1297 | | |
| |||
1303 | 1308 | | |
1304 | 1309 | | |
1305 | 1310 | | |
| 1311 | + | |
| 1312 | + | |
1306 | 1313 | | |
1307 | 1314 | | |
1308 | 1315 | | |
| |||
1311 | 1318 | | |
1312 | 1319 | | |
1313 | 1320 | | |
1314 | | - | |
1315 | | - | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
| 1324 | + | |
| 1325 | + | |
1316 | 1326 | | |
1317 | 1327 | | |
1318 | 1328 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1046 | 1046 | | |
1047 | 1047 | | |
1048 | 1048 | | |
| 1049 | + | |
| 1050 | + | |
1049 | 1051 | | |
1050 | 1052 | | |
1051 | | - | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
1052 | 1057 | | |
1053 | 1058 | | |
1054 | 1059 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
0 commit comments