Skip to content

GPU override coverage: tracking and roadmap #5

@zazabap

Description

@zazabap

Following Manifolds.jl#856, CUDA overrides live in this package. This issue tracks progress across manifolds to coordinate contributions.

All overrides target PowerManifold{ℝ, <:M, <:Tuple, ArrayPowerRepresentation} with CuArray{T,3}, replacing ManifoldsBase's sequential get_iterator loop with batched GPU ops. For many operations the single-matrix CPU code already works on CuMatrix (e.g. svd, /), but the PowerManifold loop launches N separate kernels — overrides batch these into single strided calls.

Shared helpers (helpers.jl)

  • _matrix_exp_gpu — batched matrix exponential (Taylor + scaling-and-squaring)
  • _matrix_log_gpu — batched matrix logarithm (inverse scaling-and-squaring + Denman-Beavers)
  • _batched_inv_gpu — batched matrix inverse (LU)
  • _batched_sqrtm_gpu — batched matrix square root (Denman-Beavers)
  • _cholesky_qr_gpu! — batched Cholesky-QR (A'A + potrfBatched! + trsm_batched!) (PR add GPU retract_qr_fused! via Cholesky-QR #11)

GeneralUnitaryMatrices (PR #4)

Default retraction: ExponentialRetraction() — already overridden. Default vector transport: ProjectionTransport() — already overridden. Both default paths work on GPU.

  • exp!_matrix_exp_gpu + gemm_strided_batched (generic-n; covers n=2,3,4 specializations)
  • log! (ℝ and ℂ) — _matrix_log_gpu + skew-symmetrization
  • inner, norm
  • project! (point, AbsoluteDeterminantOneMatrixType) — gesvdj!. Note: Rotations (DeterminantOneMatrixType) needs additional det correction — not covered.
  • project! (tangent) — skew(p'X) via batched GEMM
  • retract_polar_fused!gesvdj! with CPU fallback for large matrices
  • retract_qr_fused! — Cholesky-QR via _cholesky_qr_gpu!; fully batched, no size limit (PR add GPU retract_qr_fused! via Cholesky-QR #11)
Nice-to-have
  • parallel_transport_to! — building blocks exist but not wired for batched PowerManifold
  • rand! — scalar indexing in CPU code; can initialize on CPU and transfer
  • inverse_retract_polar!Rotations version uses lyap() + try/catch; no GPU path

Stiefel (PRs #12, #11)

Default retraction: PolarRetraction() — already overridden. Default vector transport: DifferentiatedRetractionVectorTransport(PolarRetraction()) — does not use log! or parallel_transport_to!. Only field covered; complex Stiefel has no overrides yet.

Nice-to-have
  • log! — delegates to iterative shooting via StiefelSubmersionMetric; hard to GPU-ify
  • parallel_transport_to!
  • inverse_retract_polar! — needs lyap(); no GPU equivalent
  • inverse_retract_qr! — scalar loop with @inbounds
  • rand! — can initialize on CPU and transfer

Grassmann (PRs #6, #7, #8, #11)

Default retraction: ExponentialRetraction()overridden in PR #6. Default vector transport: ParallelTransport()overridden in PR #8. CG/L-BFGS optimizers on Grassmann now have full GPU support. Only field covered.

Nice-to-have
  • rand! — can initialize on CPU and transfer

Sphere

  • inner, norm
  • Most other ops are already GPU-safe in Manifolds.jl (broadcasting + dot). PowerManifold of Sphere ≈ Obliquelow priority.

Known limitations

  • gesvdj! size limit: cuSOLVER's batched Jacobi SVD fails beyond a supported matrix size. Existing overrides have try/catch CPU fallback.
  • Real field only: Stiefel and Grassmann overrides dispatch on T <: Real. Complex manifolds have no batched overrides yet (GeneralUnitaryMatrices handles both via T <: Number).

Not planned (fundamentally hard)

  • SPD: nearly all ops need eigen() multiple times
  • Rotations small-n (n=4): scalar indexing + try/catch (generic-n override covers this)
  • Hyperbolic: core ops are GPU-safe; minkowski_metric scalar indexing in some paths

Proposed contribution order

1. Grassmann exp! — default retraction, effectively broken for GPU users ✓ PR #6
2. Grassmann project! (point + tangent) + retract_polar_fused! — enables polar retraction as alternative ✓ PR #7
3. Grassmann log! + inverse_retract_polar! — needed by distance and parallel_transport_to! ✓ PR #8
4. Grassmann parallel_transport_direction! + distance — needed for CG/L-BFGS and convergence checks ✓ PR #8
5. Stiefel project! (point + tangent) ✓ PR #12
6. retract_qr_fused! for GeneralUnitaryMatrices — Cholesky-QR approach ✓ PR #11 (also covers Stiefel + Grassmann)
7. Specialized overrides for Rotations n=2,3 — generic-n works but specialized CPU variants are much faster (per @mateuszbaran's feedback in this thread)

Each as a separate PR with JLArray + CUDA tests. I'd love to hear your suggestions on priorities, approaches, or anything I may have missed or gotten wrong — any feedback from the experts here would be greatly appreciated!

Related

  • Manifolds.jl#856 — original PR and discussion
  • #4 — merged GeneralUnitaryMatrices PR
  • #6 — merged Grassmann exp! PR
  • #7 — merged Grassmann project! + retract_polar_fused! PR
  • #8 — merged Grassmann log!, inverse_retract_polar!, parallel_transport_direction!, distance PR
  • #11 — Cholesky-QR retract_qr_fused! for GeneralUnitaryMatrices, Stiefel, Grassmann PR
  • #12 — merged Stiefel project! (point + tangent) PR
  • #2 — closed Euclidean PR
  • 15da7e3 — Stiefel exp! sketch by @mateuszbaran

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions