You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following Manifolds.jl#856, CUDA overrides live in this package. This issue tracks progress across manifolds to coordinate contributions.
All overrides target PowerManifold{ℝ, <:M, <:Tuple, ArrayPowerRepresentation} with CuArray{T,3}, replacing ManifoldsBase's sequential get_iterator loop with batched GPU ops. For many operations the single-matrix CPU code already works on CuMatrix (e.g. svd, /), but the PowerManifold loop launches N separate kernels — overrides batch these into single strided calls.
Default retraction: PolarRetraction() — already overridden. Default vector transport: DifferentiatedRetractionVectorTransport(PolarRetraction()) — does not use log! or parallel_transport_to!. Only ℝ field covered; complex Stiefel has no overrides yet.
Default retraction: ExponentialRetraction() — overridden in PR #6. Default vector transport: ParallelTransport() — overridden in PR #8. CG/L-BFGS optimizers on Grassmann now have full GPU support. Only ℝ field covered.
Most other ops are already GPU-safe in Manifolds.jl (broadcasting + dot). PowerManifold of Sphere ≈ Oblique — low priority.
Known limitations
gesvdj! size limit: cuSOLVER's batched Jacobi SVD fails beyond a supported matrix size. Existing overrides have try/catch CPU fallback.
Real field only: Stiefel and Grassmann overrides dispatch on T <: Real. Complex manifolds have no batched overrides yet (GeneralUnitaryMatrices handles both via T <: Number).
Hyperbolic: core ops are GPU-safe; minkowski_metric scalar indexing in some paths
Proposed contribution order
1. Grassmann exp! — default retraction, effectively broken for GPU users ✓ PR #6 2. Grassmann project! (point + tangent) + retract_polar_fused! — enables polar retraction as alternative ✓ PR #7 3. Grassmann log! + inverse_retract_polar! — needed by distance and parallel_transport_to! ✓ PR #8 4. Grassmann parallel_transport_direction! + distance — needed for CG/L-BFGS and convergence checks ✓ PR #8 5. Stiefel project! (point + tangent) ✓ PR #12 6. retract_qr_fused! for GeneralUnitaryMatrices — Cholesky-QR approach ✓ PR #11 (also covers Stiefel + Grassmann)
7. Specialized overrides for Rotations n=2,3 — generic-n works but specialized CPU variants are much faster (per @mateuszbaran's feedback in this thread)
Each as a separate PR with JLArray + CUDA tests. I'd love to hear your suggestions on priorities, approaches, or anything I may have missed or gotten wrong — any feedback from the experts here would be greatly appreciated!
Following Manifolds.jl#856, CUDA overrides live in this package. This issue tracks progress across manifolds to coordinate contributions.
All overrides target
PowerManifold{ℝ, <:M, <:Tuple, ArrayPowerRepresentation}withCuArray{T,3}, replacing ManifoldsBase's sequentialget_iteratorloop with batched GPU ops. For many operations the single-matrix CPU code already works onCuMatrix(e.g.svd,/), but the PowerManifold loop launches N separate kernels — overrides batch these into single strided calls.Shared helpers (
helpers.jl)_matrix_exp_gpu— batched matrix exponential (Taylor + scaling-and-squaring)_matrix_log_gpu— batched matrix logarithm (inverse scaling-and-squaring + Denman-Beavers)_batched_inv_gpu— batched matrix inverse (LU)_batched_sqrtm_gpu— batched matrix square root (Denman-Beavers)_cholesky_qr_gpu!— batched Cholesky-QR (A'A+potrfBatched!+trsm_batched!) (PR add GPU retract_qr_fused! via Cholesky-QR #11)GeneralUnitaryMatrices (PR #4)
Default retraction:
ExponentialRetraction()— already overridden. Default vector transport:ProjectionTransport()— already overridden. Both default paths work on GPU.exp!—_matrix_exp_gpu+gemm_strided_batched(generic-n; covers n=2,3,4 specializations)log!(ℝ and ℂ) —_matrix_log_gpu+ skew-symmetrizationinner,normproject!(point,AbsoluteDeterminantOneMatrixType) —gesvdj!. Note:Rotations(DeterminantOneMatrixType) needs additionaldetcorrection — not covered.project!(tangent) —skew(p'X)via batched GEMMretract_polar_fused!—gesvdj!with CPU fallback for large matricesretract_qr_fused!— Cholesky-QR via_cholesky_qr_gpu!; fully batched, no size limit (PR add GPU retract_qr_fused! via Cholesky-QR #11)Nice-to-have
parallel_transport_to!— building blocks exist but not wired for batched PowerManifoldrand!— scalar indexing in CPU code; can initialize on CPU and transferinverse_retract_polar!—Rotationsversion useslyap()+try/catch; no GPU pathStiefel (PRs #12, #11)
Default retraction:
PolarRetraction()— already overridden. Default vector transport:DifferentiatedRetractionVectorTransport(PolarRetraction())— does not uselog!orparallel_transport_to!. Onlyℝfield covered; complex Stiefel has no overrides yet.exp!(ℝ) — block-diagonal 2k×2k_matrix_exp_gpu+ batched GEMMinner,normretract_polar_fused!—gesvdj!with CPU fallback for large matricesproject!(point) —_polar_project_gpu!via batched SVD (PR add GPU project! (point + tangent) for Stiefel #12)project!(tangent) —X - p*sym(p'X)via batched GEMM (PR add GPU project! (point + tangent) for Stiefel #12)retract_qr_fused!— Cholesky-QR via_cholesky_qr_gpu!(PR add GPU retract_qr_fused! via Cholesky-QR #11)Nice-to-have
log!— delegates to iterative shooting viaStiefelSubmersionMetric; hard to GPU-ifyparallel_transport_to!inverse_retract_polar!— needslyap(); no GPU equivalentinverse_retract_qr!— scalar loop with@inboundsrand!— can initialize on CPU and transferGrassmann (PRs #6, #7, #8, #11)
Default retraction:
ExponentialRetraction()— overridden in PR #6. Default vector transport:ParallelTransport()— overridden in PR #8. CG/L-BFGS optimizers on Grassmann now have full GPU support. Onlyℝfield covered.inner,normproject!(point) —_polar_project_gpu!via batched SVD (PR add GPU project! and retract_polar_fused! for Grassmann #7)project!(tangent) —X - p*(p'X)via batched GEMM (PR add GPU project! and retract_polar_fused! for Grassmann #7)retract_polar_fused!— delegates to_polar_project_gpu!(PR add GPU project! and retract_polar_fused! for Grassmann #7)exp!— SVD + polar orthogonalization, replaces CPU'sqr(z).Qround-trip (PR add GPU exp! for Grassmann, gesvda! fallback, benchmark cleanup #6)log!— batched inverse (_batched_inv_gpu) +gesvdj!+atan.(S)(PR add GPU log!, inverse_retract_polar!, parallel_transport_direction!, distance for Grassmann #8)inverse_retract_polar!—q * inv(p'q) - pvia batched LU inverse + in-place GEMM (PR add GPU log!, inverse_retract_polar!, parallel_transport_direction!, distance for Grassmann #8)parallel_transport_direction!— SVD of direction + sin/cos rotation + projection (PR add GPU log!, inverse_retract_polar!, parallel_transport_direction!, distance for Grassmann #8)distance— inverse retract → batched SVD →norm(atan.(S))(PR add GPU log!, inverse_retract_polar!, parallel_transport_direction!, distance for Grassmann #8)retract_qr_fused!— Cholesky-QR via_cholesky_qr_gpu!(PR add GPU retract_qr_fused! via Cholesky-QR #11)Nice-to-have
rand!— can initialize on CPU and transferSphere
inner,normdot). PowerManifold of Sphere ≈Oblique— low priority.Known limitations
gesvdj!size limit: cuSOLVER's batched Jacobi SVD fails beyond a supported matrix size. Existing overrides have try/catch CPU fallback.T <: Real. Complex manifolds have no batched overrides yet (GeneralUnitaryMatrices handles both viaT <: Number).Not planned (fundamentally hard)
eigen()multiple timestry/catch(generic-n override covers this)minkowski_metricscalar indexing in some pathsProposed contribution order
1. Grassmann✓ PR #6exp!— default retraction, effectively broken for GPU users2. Grassmann✓ PR #7project!(point + tangent) +retract_polar_fused!— enables polar retraction as alternative3. Grassmann✓ PR #8log!+inverse_retract_polar!— needed bydistanceandparallel_transport_to!4. Grassmann✓ PR #8parallel_transport_direction!+distance— needed for CG/L-BFGS and convergence checks5. Stiefel✓ PR #12project!(point + tangent)6.✓ PR #11 (also covers Stiefel + Grassmann)retract_qr_fused!for GeneralUnitaryMatrices — Cholesky-QR approach7. Specialized overrides for Rotations n=2,3 — generic-n works but specialized CPU variants are much faster (per @mateuszbaran's feedback in this thread)
Each as a separate PR with JLArray + CUDA tests. I'd love to hear your suggestions on priorities, approaches, or anything I may have missed or gotten wrong — any feedback from the experts here would be greatly appreciated!
Related
retract_qr_fused!for GeneralUnitaryMatrices, Stiefel, Grassmann PRproject!(point + tangent) PR