GPU override coverage: tracking and roadmap

Following [Manifolds.jl#856](https://github.com/JuliaManifolds/Manifolds.jl/pull/856), CUDA overrides live in this package. This issue tracks progress across manifolds to coordinate contributions.

All overrides target `PowerManifold{ℝ, <:M, <:Tuple, ArrayPowerRepresentation}` with `CuArray{T,3}`, replacing ManifoldsBase's sequential `get_iterator` loop with batched GPU ops. For many operations the single-matrix CPU code already works on `CuMatrix` (e.g. `svd`, `/`), but the PowerManifold loop launches N separate kernels — overrides batch these into single strided calls.

## Shared helpers (`helpers.jl`)

- [x] `_matrix_exp_gpu` — batched matrix exponential (Taylor + scaling-and-squaring)
- [x] `_matrix_log_gpu` — batched matrix logarithm (inverse scaling-and-squaring + Denman-Beavers)
- [x] `_batched_inv_gpu` — batched matrix inverse (LU)
- [x] `_batched_sqrtm_gpu` — batched matrix square root (Denman-Beavers)
- [x] `_cholesky_qr_gpu!` — batched Cholesky-QR (`A'A` + `potrfBatched!` + `trsm_batched!`) (PR #11)

## GeneralUnitaryMatrices (PR [#4](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/4))

Default retraction: `ExponentialRetraction()` — already overridden. Default vector transport: `ProjectionTransport()` — already overridden. **Both default paths work on GPU.**

- [x] `exp!` — `_matrix_exp_gpu` + `gemm_strided_batched` (generic-n; covers n=2,3,4 specializations)
- [x] `log!` (ℝ and ℂ) — `_matrix_log_gpu` + skew-symmetrization
- [x] `inner`, `norm`
- [x] `project!` (point, `AbsoluteDeterminantOneMatrixType`) — `gesvdj!`. Note: `Rotations` (`DeterminantOneMatrixType`) needs additional `det` correction — not covered.
- [x] `project!` (tangent) — `skew(p'X)` via batched GEMM
- [x] `retract_polar_fused!` — `gesvdj!` with CPU fallback for large matrices
- [x] `retract_qr_fused!` — Cholesky-QR via `_cholesky_qr_gpu!`; fully batched, no size limit (PR #11)

<details><summary>Nice-to-have</summary>

- [ ] `parallel_transport_to!` — building blocks exist but not wired for batched PowerManifold
- [ ] `rand!` — scalar indexing in CPU code; can initialize on CPU and transfer
- [ ] `inverse_retract_polar!` — `Rotations` version uses `lyap()` + `try/catch`; no GPU path

</details>

## Stiefel (PRs [#12](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/12), [#11](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/11))

Default retraction: `PolarRetraction()` — already overridden. Default vector transport: `DifferentiatedRetractionVectorTransport(PolarRetraction())` — does not use `log!` or `parallel_transport_to!`. Only `ℝ` field covered; complex Stiefel has no overrides yet.

- [x] `exp!` (ℝ) — block-diagonal 2k×2k `_matrix_exp_gpu` + batched GEMM
- [x] `inner`, `norm`
- [x] `retract_polar_fused!` — `gesvdj!` with CPU fallback for large matrices
- [x] `project!` (point) — `_polar_project_gpu!` via batched SVD (PR #12)
- [x] `project!` (tangent) — `X - p*sym(p'X)` via batched GEMM (PR #12)
- [x] `retract_qr_fused!` — Cholesky-QR via `_cholesky_qr_gpu!` (PR #11)

<details><summary>Nice-to-have</summary>

- [ ] `log!` — delegates to iterative shooting via `StiefelSubmersionMetric`; hard to GPU-ify
- [ ] `parallel_transport_to!`
- [ ] `inverse_retract_polar!` — needs `lyap()`; no GPU equivalent
- [ ] `inverse_retract_qr!` — scalar loop with `@inbounds`
- [ ] `rand!` — can initialize on CPU and transfer

</details>

## Grassmann (PRs [#6](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/6), [#7](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/7), [#8](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/8), [#11](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/11))

Default retraction: `ExponentialRetraction()` — **overridden in PR #6**. Default vector transport: `ParallelTransport()` — **overridden in PR #8**. CG/L-BFGS optimizers on Grassmann now have full GPU support. Only `ℝ` field covered.

- [x] `inner`, `norm`
- [x] `project!` (point) — `_polar_project_gpu!` via batched SVD (PR #7)
- [x] `project!` (tangent) — `X - p*(p'X)` via batched GEMM (PR #7)
- [x] `retract_polar_fused!` — delegates to `_polar_project_gpu!` (PR #7)
- [x] `exp!` — SVD + polar orthogonalization, replaces CPU's `qr(z).Q` round-trip (PR #6)
- [x] `log!` — batched inverse (`_batched_inv_gpu`) + `gesvdj!` + `atan.(S)` (PR #8)
- [x] `inverse_retract_polar!` — `q * inv(p'q) - p` via batched LU inverse + in-place GEMM (PR #8)
- [x] `parallel_transport_direction!` — SVD of direction + sin/cos rotation + projection (PR #8)
- [x] `distance` — inverse retract → batched SVD → `norm(atan.(S))` (PR #8)
- [x] `retract_qr_fused!` — Cholesky-QR via `_cholesky_qr_gpu!` (PR #11)

<details><summary>Nice-to-have</summary>

- [ ] `rand!` — can initialize on CPU and transfer

</details>

## Sphere

- [x] `inner`, `norm`
- Most other ops are already GPU-safe in Manifolds.jl (broadcasting + `dot`). PowerManifold of Sphere ≈ `Oblique` — [low priority](https://github.com/JuliaManifolds/Manifolds.jl/pull/856#issuecomment-4012114915).

## Known limitations

- **`gesvdj!` size limit**: cuSOLVER's batched Jacobi SVD fails beyond a supported matrix size. Existing overrides have try/catch CPU fallback.
- **Real field only**: Stiefel and Grassmann overrides dispatch on `T <: Real`. Complex manifolds have no batched overrides yet (GeneralUnitaryMatrices handles both via `T <: Number`).

## Not planned (fundamentally hard)

- **SPD**: nearly all ops need `eigen()` multiple times
- **Rotations small-n (n=4)**: scalar indexing + `try/catch` (generic-n override covers this)
- **Hyperbolic**: core ops are GPU-safe; `minkowski_metric` scalar indexing in some paths

## Proposed contribution order

~~1. Grassmann `exp!` — default retraction, effectively broken for GPU users~~ ✓ PR #6
~~2. Grassmann `project!` (point + tangent) + `retract_polar_fused!` — enables polar retraction as alternative~~ ✓ PR #7
~~3. Grassmann `log!` + `inverse_retract_polar!` — needed by `distance` and `parallel_transport_to!`~~ ✓ PR #8
~~4. Grassmann `parallel_transport_direction!` + `distance` — needed for CG/L-BFGS and convergence checks~~ ✓ PR #8
~~5. Stiefel `project!` (point + tangent)~~ ✓ PR #12
~~6. `retract_qr_fused!` for GeneralUnitaryMatrices — Cholesky-QR approach~~ ✓ PR #11 (also covers Stiefel + Grassmann)
7. Specialized overrides for Rotations n=2,3 — generic-n works but specialized CPU variants are much faster (per @mateuszbaran's feedback in this thread)

Each as a separate PR with JLArray + CUDA tests. I'd love to hear your suggestions on priorities, approaches, or anything I may have missed or gotten wrong — any feedback from the experts here would be greatly appreciated!

## Related

- [Manifolds.jl#856](https://github.com/JuliaManifolds/Manifolds.jl/pull/856) — original PR and discussion
- [#4](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/4) — merged GeneralUnitaryMatrices PR
- [#6](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/6) — merged Grassmann exp! PR
- [#7](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/7) — merged Grassmann project! + retract_polar_fused! PR
- [#8](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/8) — merged Grassmann log!, inverse_retract_polar!, parallel_transport_direction!, distance PR
- [#11](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/11) — Cholesky-QR `retract_qr_fused!` for GeneralUnitaryMatrices, Stiefel, Grassmann PR
- [#12](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/12) — merged Stiefel `project!` (point + tangent) PR
- [#2](https://github.com/JuliaManifolds/ManifoldsGPU.jl/pull/2) — closed Euclidean PR
- [15da7e3](https://github.com/JuliaManifolds/ManifoldsGPU.jl/commit/15da7e3f0fe3ef2e975b9ef84eed992c7346e2b1) — Stiefel exp! sketch by @mateuszbaran


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU override coverage: tracking and roadmap #5

Shared helpers (`helpers.jl`)

GeneralUnitaryMatrices (PR #4)

Stiefel (PRs #12, #11)

Grassmann (PRs #6, #7, #8, #11)

Sphere

Known limitations

Not planned (fundamentally hard)

Proposed contribution order

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU override coverage: tracking and roadmap #5

Description

Shared helpers (helpers.jl)

GeneralUnitaryMatrices (PR #4)

Stiefel (PRs #12, #11)

Grassmann (PRs #6, #7, #8, #11)

Sphere

Known limitations

Not planned (fundamentally hard)

Proposed contribution order

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Shared helpers (`helpers.jl`)