Skip to content

Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example#3195

Draft
Johnsonms wants to merge 1 commit intoNVIDIA:mainfrom
Johnsonms:cutedsl/hopper-fp8-grouped-blockwise-sparse-groups
Draft

Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example#3195
Johnsonms wants to merge 1 commit intoNVIDIA:mainfrom
Johnsonms:cutedsl/hopper-fp8-grouped-blockwise-sparse-groups

Conversation

@Johnsonms
Copy link
Copy Markdown
Contributor

Summary

Adds a CuTeDSL port of CUTLASS Example 68's sparse-groups variant (68_..._grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu) at examples/python/CuTeDSL/hopper/dense_gemm_fp8_grouped_blockwise_sparse_groups.py.

Same per-row SFA (ScaleGranularityM = 1) + blockwise SFB (ScaleGranularityN = 128) FP8 grouped GEMM as the dense Example 68 port, plus host-side support for problem distributions where many groups have zero problem sizes.

Sparse-groups behaviour:

  • --problem_sizes accepts groups with any zero dim. Validation only requires multiples of 128 for non-empty groups.
  • Empty groups share a single stub GMEM allocation, so the metadata pointer table is always valid (the kernel never reads from the stubs).
  • The host zeros all dims of every empty group before handing the problem-size table to the kernel. The persistent group tile scheduler computes a group's tile count from MN alone, so a group with M, N > 0 but K = 0 would otherwise consume MN linear tile slots and offset every later group's tiles. Forcing all dims to zero makes the scheduler reserve zero linear tiles for empty groups uniformly. The original sizes are kept for the reference and bandwidth paths.
  • Reported GBPS uses the original (un-padded) sizes; empty groups contribute nothing to throughput.
  • --sparse_fraction (with --seed) randomly empties a fraction of groups; the all-empty case is short-circuited before kernel launch.

Schedule note: the C++ source uses KernelPtrArrayTmaWarpSpecializedPingpongFP8Blockwise. This Python port keeps the cooperative schedule from the dense Example 68 variant (atom_layout_mnk = (2, 1, 1)). The sparse-groups host-side semantics are independent of pingpong vs cooperative scheduling, and the per-WG tensormap workspace required for true pingpong with grouped GEMM is left as a follow-up.

Test plan

  • Bit-exact match for dense (no empty groups) across all four cluster shapes
  • Bit-exact match with explicit empty groups (M=0, N=0, K=0 in different positions, mixed)
  • Bit-exact match with --sparse_fraction 0.5 / 0.9 / 1.0
  • compute-sanitizer --tool memcheck clean (including with --use_cold_l2)
  • All-empty case short-circuits before kernel launch
  • Per-iteration JitArguments._keepalive ensures --use_cold_l2 pre-generated workspaces survive the full benchmark
  • Performance comparison against the C++ Example 68 sparse-groups binary on H100

Correctness

Config Result
Default 4 groups, dense, all four cluster shapes PASS × 4
--num_groups 5 --problem_sizes "(256,256,256),(0,256,256),(128,256,256),(256,0,256),(256,256,0)" PASS (2 active groups)
--num_groups 8 --sparse_fraction 0.5 --seed 11 PASS (4 active groups)
--num_groups 8 --sparse_fraction 0.9 --seed 3 PASS (1 active group)
--num_groups 4 --sparse_fraction 1.0 PASS (no work, short-circuited)

compute-sanitizer --tool memcheck, sparse + --use_cold_l2: 0 errors.

Performance (H100 80GB HBM3, FP8 E4M3FN)

All non-empty groups M=N=K=2048, --sparse_fraction 0.5 --seed 11 --cluster_shape_mn 1,1, 200 iterations + 10 warmup. The C++ binary's --m=N --groups=K mode randomizes per-group sizes, so its GFLOPS reflect a different work distribution than CuTeDSL's "exactly half-empty" — they are not directly comparable but are included as a sanity baseline:

Groups (half empty) C++ runtime (random sizes) C++ TFLOPS CuTeDSL runtime CuTeDSL TFLOPS CuTeDSL GBPS
16 (8 active 2048³) 0.829 ms 332 0.167 ms 822 609
64 (32 active 2048³) 3.275 ms 336 0.623 ms 883 654

CuTeDSL numbers reflect actual non-empty work only and are the appropriate signal for "throughput on the active groups". For an apples-to-apples C++ comparison the maintainers should generate a benchmark file (--benchmark=path.txt) listing exactly the same half-empty problem distribution.

CuTeDSL port of CUTLASS Example 68's sparse-groups variant
(68_..._grouped_gemm_with_blockwise_scaling_with_sparse_groups). Same
per-row SFA (ScaleGranularityM = 1) + blockwise SFB (ScaleGranularityN
= 128) FP8 grouped GEMM as the dense Example 68 port; the host driver
adds support for problem distributions where many groups have zero
problem sizes.

Sparse-groups behaviour:

  - --problem_sizes accepts groups with any zero dim. Validation only
    requires multiples of 128 for non-empty groups.

  - Empty groups share a single stub GMEM allocation, so the metadata
    pointer table is always valid (the kernel never reads from the
    stubs).

  - The host zeros all dims of every empty group before handing the
    problem-size table to the kernel. The persistent group tile
    scheduler computes a group's tile count from M*N alone, so a group
    with M, N > 0 but K = 0 would otherwise consume M*N linear tile
    slots and offset every later group's tiles. Forcing all dims to
    zero makes the scheduler reserve zero linear tiles for empty
    groups uniformly. The original sizes are kept for the reference
    and bandwidth paths.

  - Reported GBPS uses the original (un-padded) sizes; empty groups
    contribute nothing to throughput.

  - --sparse_fraction (with --seed) randomly empties a fraction of
    groups; the all-empty case is short-circuited before kernel launch.

Schedule note: the C++ source uses
KernelPtrArrayTmaWarpSpecializedPingpongFP8Blockwise. This Python port
keeps the cooperative schedule from the dense Example 68 variant
(atom_layout_mnk = (2, 1, 1)). The sparse-groups host-side semantics
are independent of pingpong vs cooperative scheduling, and the per-WG
tensormap workspace required for true pingpong with grouped GEMM is
left as a follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant