so does it need whitespaces??

inikishev · inikishev · commit 98920c1bcbef · 2025-06-20T18:39:44.000+03:00
diff --git a/torchzero/modules/clipping/clipping.py b/torchzero/modules/clipping/clipping.py
@@ -383,7 +383,6 @@ class Centralize(Transform):
         Standard gradient centralization:
 
         .. code-block:: python
-
             opt = tz.Modular(
                 model.parameters(),
                 tz.m.Centralize(dim=0),
diff --git a/torchzero/modules/line_search/line_search.py b/torchzero/modules/line_search/line_search.py
@@ -36,6 +36,7 @@ class LineSearch(Module, ABC):
 
     Examples:
         #### Basic line search
+
         This evaluates all step sizes in a range by using the :code:`self.evaluate_step_size` method.
 
         .. code-block:: python
@@ -64,6 +65,7 @@ def search(self, update, var):
                     return best_step_size
 
         #### Using external solver via self.make_objective
+
         Here we let :code:`scipy.optimize.minimize_scalar` solver find the best step size via :code:`self.make_objective`
 
         .. code-block:: python
diff --git a/torchzero/modules/momentum/cautious.py b/torchzero/modules/momentum/cautious.py
@@ -57,7 +57,9 @@ class Cautious(Transform):
 
     Examples:
         Cautious Adam
+
         .. code-block:: python
+
             opt = tz.Modular(
                 bench.parameters(),
                 tz.m.Adam(),
@@ -171,7 +173,9 @@ class ScaleByGradCosineSimilarity(Transform):
 
     Examples:
         Scaled Adam
+
         .. code-block:: python
+
             opt = tz.Modular(
                 bench.parameters(),
                 tz.m.Adam(),
@@ -207,7 +211,9 @@ class ScaleModulesByCosineSimilarity(Module):
 
     Example:
         Adam scaled by similarity to RMSprop
+
         .. code-block:: python
+
             opt = tz.Modular(
                 bench.parameters(),
                 tz.m.ScaleModulesByCosineSimilarity(
diff --git a/torchzero/modules/momentum/matrix_momentum.py b/torchzero/modules/momentum/matrix_momentum.py
@@ -15,14 +15,12 @@ class MatrixMomentum(Module):
         :code:`mu` is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable.
 
     .. note::
-        Because MatrixMomentum relies on extra autograd, in most cases it should be the first module in the chain.
-
-    .. note::
-        If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
+        In most cases MatrixMomentum should be the first module in the chain because it relies on autograd.
 
     .. note::
         This module requires the a closure passed to the optimizer step,
         as it needs to re-evaluate the loss and gradients for calculating HVPs.
+        The closure must accept a ``backward`` argument (refer to documentation).
 
     Args:
         mu (float, optional): this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.
@@ -105,14 +103,12 @@ class AdaptiveMatrixMomentum(Module):
     This version estimates mu via a simple heuristic: ||s||/||y||, where s is parameter difference, y is gradient difference.
 
     .. note::
-        Because AdaptiveMatrixMomentum relies on extra autograd, in most cases it should be the first module in the chain.
-
-    .. note::
-        If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
+        In most cases MatrixMomentum should be the first module in the chain because it relies on autograd.
 
     .. note::
         This module requires the a closure passed to the optimizer step,
         as it needs to re-evaluate the loss and gradients for calculating HVPs.
+        The closure must accept a ``backward`` argument (refer to documentation).
 
 
     Args:
diff --git a/torchzero/modules/optimizers/adahessian.py b/torchzero/modules/optimizers/adahessian.py
@@ -71,14 +71,15 @@ class AdaHessian(Module):
     This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of squared randomized hessian diagonal estimates.
 
     .. note::
-        Because AdaHessian relies on extra autograd, in most cases it should be the first module in the chain. Use the :code:`inner` argument if you wish to apply AdaHessian preconditioning to another module's output.
+        In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply AdaHessian preconditioning to another module's output.
 
     .. note::
         If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
 
     .. note::
         This module requires the a closure passed to the optimizer step,
         as it needs to re-evaluate the loss and gradients for calculating HVPs.
+        The closure must accept a ``backward`` argument (refer to documentation).
 
     Args:
         beta1 (float, optional): first momentum. Defaults to 0.9.
diff --git a/torchzero/modules/optimizers/sophia_h.py b/torchzero/modules/optimizers/sophia_h.py
@@ -40,14 +40,15 @@ class SophiaH(Module):
     This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.
 
     .. note::
-        Because SophiaH relies on extra autograd, in most cases it should be the first module in the chain. Use the :code:`inner` argument if you wish to apply SophiaH preconditioning to another module's output.
+        In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply SophiaH preconditioning to another module's output.
 
     .. note::
         If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".
 
     .. note::
         This module requires the a closure passed to the optimizer step,
         as it needs to re-evaluate the loss and gradients for calculating HVPs.
+        The closure must accept a ``backward`` argument (refer to documentation).
 
     Args:
         beta1 (float, optional): first momentum. Defaults to 0.96.
diff --git a/torchzero/modules/quasi_newton/lsr1.py b/torchzero/modules/quasi_newton/lsr1.py
@@ -75,10 +75,14 @@ def lsr1_(
 class LSR1(Module):
     """Limited Memory SR1 algorithm. A line search is recommended.
 
-    Notes:
-        - L-SR1 provides a better estimate of true hessian, however it is significantly more unstable compared to L-BFGS.
-        - L-SR1 update rule uses a nested loop, computationally with history size `n` it is similar to L-BFGS with history size `n!` (n factorial). On small problems BFGS and SR1 may be faster than limited-memory versions.
-        - directions L-SR1 generates are not guaranteed to be descent directions. This can be alleviated in multiple ways,
+    .. note::
+        L-SR1 provides a better estimate of true hessian, however it is significantly more unstable compared to L-BFGS.
+
+    .. note::
+        L-SR1 update rule uses a nested loop, computationally with history size `n` it is similar to L-BFGS with history size `n!` (n factorial). On small problems BFGS and SR1 may be faster than limited-memory versions.
+
+    .. note::
+        directions L-SR1 generates are not guaranteed to be descent directions. This can be alleviated in multiple ways,
         for example using :code:`tz.m.StrongWolfe(plus_minus=True)` line search, or modifying the direction with :code:`tz.m.Cautious` or :code:`tz.m.ScaleByGradCosineSimilarity`.
 
     Args:
diff --git a/torchzero/modules/quasi_newton/quasi_newton.py b/torchzero/modules/quasi_newton/quasi_newton.py
@@ -59,7 +59,9 @@ class HessianUpdateStrategy(TensorwiseTransform, ABC):
 
     Example:
         Implementing BFGS method that maintains an estimate of the hessian inverse (H):
+
         .. code-block:: python
+
             class BFGS(HessianUpdateStrategy):
                 def __init__(
                     self,
@@ -99,6 +101,7 @@ def update_H(self, H, s, y, p, g, p_prev, g_prev, state, settings):
                     term2 = num2.div_(sy)
                     H += term1.sub_(term2)
                     return H
+
     """
     def __init__(
         self,
@@ -227,7 +230,6 @@ class HUpdateStrategy(HessianUpdateStrategy):
     Refer to :code:`HessianUpdateStrategy` documentation.
 
     Example:
-
         Implementing BFGS method that maintains an estimate of the hessian inverse (H):
 
         .. code-block:: python
@@ -324,15 +326,19 @@ class BFGS(HUpdateStrategy):
 
     Examples:
         BFGS with strong-wolfe line search:
+
         .. code-block:: python
+
             opt = tz.Modular(
                 model.parameters(),
                 tz.m.BFGS(),
                 tz.m.StrongWolfe()
             )
 
         BFGS preconditioning applied to momentum:
+
         .. code-block:: python
+
             opt = tz.Modular(
                 model.parameters(),
                 tz.m.BFGS(inner=tz.m.EMA(0.9)),
@@ -403,15 +409,19 @@ class SR1(HUpdateStrategy):
 
     Examples:
         SR1 with strong-wolfe line search
+
         .. code-block:: python
+
             opt = tz.Modular(
                 model.parameters(),
                 tz.m.SR1(),
                 tz.m.StrongWolfe()
             )
 
         BFGS preconditioning applied to momentum
+
         .. code-block:: python
+
             opt = tz.Modular(
                 model.parameters(),
                 tz.m.SR1(inner=tz.m.EMA(0.9)),
diff --git a/torchzero/modules/second_order/newton.py b/torchzero/modules/second_order/newton.py
@@ -62,7 +62,7 @@ class Newton(Module):
     """Exact newton's method via autograd.
 
     .. note::
-        In most cases Newton should be the first module in the chain because it relies on extra autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
+        In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
 
     .. note::
         This module requires the a closure passed to the optimizer step,
diff --git a/torchzero/modules/second_order/newton_cg.py b/torchzero/modules/second_order/newton_cg.py
@@ -19,7 +19,7 @@ class NewtonCG(Module):
     differentiation or approximated using finite differences.
 
     .. note::
-        In most cases NewtonCG should be the first module in the chain because it relies on extra autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
+        In most cases NewtonCG should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
 
     .. note::
         This module requires the a closure passed to the optimizer step,
diff --git a/torchzero/modules/second_order/nystrom.py b/torchzero/modules/second_order/nystrom.py
@@ -18,7 +18,7 @@ class NystromSketchAndSolve(Module):
         The closure must accept a ``backward`` argument (refer to documentation).
 
     .. note::
-        In most cases NystromSketchAndSolve should be the first module in the chain because it relies on extra autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
+        In most cases NystromSketchAndSolve should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
 
     .. note::
         If this is unstable, increase the :code:`reg` parameter and tune the rank.
@@ -144,7 +144,7 @@ class NystromPCG(Module):
         The closure must accept a ``backward`` argument (refer to documentation).
 
     .. note::
-        In most cases NystromPCG should be the first module in the chain because it relies on extra autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
+        In most cases NystromPCG should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Newton preconditioning to another module's output.
 
     Args:
         sketch_size (int):