Skip to content

Commit 98920c1

Browse files
committed
so does it need whitespaces??
1 parent 1e54e07 commit 98920c1

11 files changed

Lines changed: 39 additions & 20 deletions

File tree

torchzero/modules/clipping/clipping.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -383,7 +383,6 @@ class Centralize(Transform):
383383
Standard gradient centralization:
384384
385385
.. code-block:: python
386-
387386
opt = tz.Modular(
388387
model.parameters(),
389388
tz.m.Centralize(dim=0),

torchzero/modules/line_search/line_search.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ class LineSearch(Module, ABC):
3636
3737
Examples:
3838
#### Basic line search
39+
3940
This evaluates all step sizes in a range by using the :code:`self.evaluate_step_size` method.
4041
4142
.. code-block:: python
@@ -64,6 +65,7 @@ def search(self, update, var):
6465
return best_step_size
6566
6667
#### Using external solver via self.make_objective
68+
6769
Here we let :code:`scipy.optimize.minimize_scalar` solver find the best step size via :code:`self.make_objective`
6870
6971
.. code-block:: python

torchzero/modules/momentum/cautious.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,9 @@ class Cautious(Transform):
5757
5858
Examples:
5959
Cautious Adam
60+
6061
.. code-block:: python
62+
6163
opt = tz.Modular(
6264
bench.parameters(),
6365
tz.m.Adam(),
@@ -171,7 +173,9 @@ class ScaleByGradCosineSimilarity(Transform):
171173
172174
Examples:
173175
Scaled Adam
176+
174177
.. code-block:: python
178+
175179
opt = tz.Modular(
176180
bench.parameters(),
177181
tz.m.Adam(),
@@ -207,7 +211,9 @@ class ScaleModulesByCosineSimilarity(Module):
207211
208212
Example:
209213
Adam scaled by similarity to RMSprop
214+
210215
.. code-block:: python
216+
211217
opt = tz.Modular(
212218
bench.parameters(),
213219
tz.m.ScaleModulesByCosineSimilarity(

torchzero/modules/momentum/matrix_momentum.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,12 @@ class MatrixMomentum(Module):
1515
:code:`mu` is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable.
1616
1717
.. note::
18-
Because MatrixMomentum relies on extra autograd, in most cases it should be the first module in the chain.
19-
20-
.. note::
21-
If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
18+
In most cases MatrixMomentum should be the first module in the chain because it relies on autograd.
2219
2320
.. note::
2421
This module requires the a closure passed to the optimizer step,
2522
as it needs to re-evaluate the loss and gradients for calculating HVPs.
23+
The closure must accept a ``backward`` argument (refer to documentation).
2624
2725
Args:
2826
mu (float, optional): this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.
@@ -105,14 +103,12 @@ class AdaptiveMatrixMomentum(Module):
105103
This version estimates mu via a simple heuristic: ||s||/||y||, where s is parameter difference, y is gradient difference.
106104
107105
.. note::
108-
Because AdaptiveMatrixMomentum relies on extra autograd, in most cases it should be the first module in the chain.
109-
110-
.. note::
111-
If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
106+
In most cases MatrixMomentum should be the first module in the chain because it relies on autograd.
112107
113108
.. note::
114109
This module requires the a closure passed to the optimizer step,
115110
as it needs to re-evaluate the loss and gradients for calculating HVPs.
111+
The closure must accept a ``backward`` argument (refer to documentation).
116112
117113
118114
Args:

torchzero/modules/optimizers/adahessian.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,14 +71,15 @@ class AdaHessian(Module):
7171
This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of squared randomized hessian diagonal estimates.
7272
7373
.. note::
74-
Because AdaHessian relies on extra autograd, in most cases it should be the first module in the chain. Use the :code:`inner` argument if you wish to apply AdaHessian preconditioning to another module's output.
74+
In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply AdaHessian preconditioning to another module's output.
7575
7676
.. note::
7777
If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
7878
7979
.. note::
8080
This module requires the a closure passed to the optimizer step,
8181
as it needs to re-evaluate the loss and gradients for calculating HVPs.
82+
The closure must accept a ``backward`` argument (refer to documentation).
8283
8384
Args:
8485
beta1 (float, optional): first momentum. Defaults to 0.9.

torchzero/modules/optimizers/sophia_h.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,15 @@ class SophiaH(Module):
4040
This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.
4141
4242
.. note::
43-
Because SophiaH relies on extra autograd, in most cases it should be the first module in the chain. Use the :code:`inner` argument if you wish to apply SophiaH preconditioning to another module's output.
43+
In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply SophiaH preconditioning to another module's output.
4444
4545
.. note::
4646
If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
4747
4848
.. note::
4949
This module requires the a closure passed to the optimizer step,
5050
as it needs to re-evaluate the loss and gradients for calculating HVPs.
51+
The closure must accept a ``backward`` argument (refer to documentation).
5152
5253
Args:
5354
beta1 (float, optional): first momentum. Defaults to 0.96.

torchzero/modules/quasi_newton/lsr1.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,14 @@ def lsr1_(
7575
class LSR1(Module):
7676
"""Limited Memory SR1 algorithm. A line search is recommended.
7777
78-
Notes:
79-
- L-SR1 provides a better estimate of true hessian, however it is significantly more unstable compared to L-BFGS.
80-
- L-SR1 update rule uses a nested loop, computationally with history size `n` it is similar to L-BFGS with history size `n!` (n factorial). On small problems BFGS and SR1 may be faster than limited-memory versions.
81-
- directions L-SR1 generates are not guaranteed to be descent directions. This can be alleviated in multiple ways,
78+
.. note::
79+
L-SR1 provides a better estimate of true hessian, however it is significantly more unstable compared to L-BFGS.
80+
81+
.. note::
82+
L-SR1 update rule uses a nested loop, computationally with history size `n` it is similar to L-BFGS with history size `n!` (n factorial). On small problems BFGS and SR1 may be faster than limited-memory versions.
83+
84+
.. note::
85+
directions L-SR1 generates are not guaranteed to be descent directions. This can be alleviated in multiple ways,
8286
for example using :code:`tz.m.StrongWolfe(plus_minus=True)` line search, or modifying the direction with :code:`tz.m.Cautious` or :code:`tz.m.ScaleByGradCosineSimilarity`.
8387
8488
Args:

torchzero/modules/quasi_newton/quasi_newton.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,9 @@ class HessianUpdateStrategy(TensorwiseTransform, ABC):
5959
6060
Example:
6161
Implementing BFGS method that maintains an estimate of the hessian inverse (H):
62+
6263
.. code-block:: python
64+
6365
class BFGS(HessianUpdateStrategy):
6466
def __init__(
6567
self,
@@ -99,6 +101,7 @@ def update_H(self, H, s, y, p, g, p_prev, g_prev, state, settings):
99101
term2 = num2.div_(sy)
100102
H += term1.sub_(term2)
101103
return H
104+
102105
"""
103106
def __init__(
104107
self,
@@ -227,7 +230,6 @@ class HUpdateStrategy(HessianUpdateStrategy):
227230
Refer to :code:`HessianUpdateStrategy` documentation.
228231
229232
Example:
230-
231233
Implementing BFGS method that maintains an estimate of the hessian inverse (H):
232234
233235
.. code-block:: python
@@ -324,15 +326,19 @@ class BFGS(HUpdateStrategy):
324326
325327
Examples:
326328
BFGS with strong-wolfe line search:
329+
327330
.. code-block:: python
331+
328332
opt = tz.Modular(
329333
model.parameters(),
330334
tz.m.BFGS(),
331335
tz.m.StrongWolfe()
332336
)
333337
334338
BFGS preconditioning applied to momentum:
339+
335340
.. code-block:: python
341+
336342
opt = tz.Modular(
337343
model.parameters(),
338344
tz.m.BFGS(inner=tz.m.EMA(0.9)),
@@ -403,15 +409,19 @@ class SR1(HUpdateStrategy):
403409
404410
Examples:
405411
SR1 with strong-wolfe line search
412+
406413
.. code-block:: python
414+
407415
opt = tz.Modular(
408416
model.parameters(),
409417
tz.m.SR1(),
410418
tz.m.StrongWolfe()
411419
)
412420
413421
BFGS preconditioning applied to momentum
422+
414423
.. code-block:: python
424+
415425
opt = tz.Modular(
416426
model.parameters(),
417427
tz.m.SR1(inner=tz.m.EMA(0.9)),

torchzero/modules/second_order/newton.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ class Newton(Module):
6262
"""Exact newton's method via autograd.
6363
6464
.. note::
65-
In most cases Newton should be the first module in the chain because it relies on extra autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
65+
In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
6666
6767
.. note::
6868
This module requires the a closure passed to the optimizer step,

torchzero/modules/second_order/newton_cg.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ class NewtonCG(Module):
1919
differentiation or approximated using finite differences.
2020
2121
.. note::
22-
In most cases NewtonCG should be the first module in the chain because it relies on extra autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
22+
In most cases NewtonCG should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
2323
2424
.. note::
2525
This module requires the a closure passed to the optimizer step,

0 commit comments

Comments
 (0)