Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example by Johnsonms · Pull Request #3195 · NVIDIA/cutlass

Johnsonms · 2026-04-29T22:20:08Z

Summary

Adds a CuTeDSL port of CUTLASS Example 68's sparse-groups variant (68_..._grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu) at examples/python/CuTeDSL/hopper/dense_gemm_fp8_grouped_blockwise_sparse_groups.py.

Same per-row SFA (ScaleGranularityM = 1) + blockwise SFB (ScaleGranularityN = 128) FP8 grouped GEMM as the dense Example 68 port, plus host-side support for problem distributions where many groups have zero problem sizes.

Sparse-groups behaviour:

--problem_sizes accepts groups with any zero dim. Validation only requires multiples of 128 for non-empty groups.
Empty groups share a single stub GMEM allocation, so the metadata pointer table is always valid (the kernel never reads from the stubs).
The host zeros all dims of every empty group before handing the problem-size table to the kernel. The persistent group tile scheduler computes a group's tile count from MN alone, so a group with M, N > 0 but K = 0 would otherwise consume MN linear tile slots and offset every later group's tiles. Forcing all dims to zero makes the scheduler reserve zero linear tiles for empty groups uniformly. The original sizes are kept for the reference and bandwidth paths.
Reported GBPS uses the original (un-padded) sizes; empty groups contribute nothing to throughput.
--sparse_fraction (with --seed) randomly empties a fraction of groups; the all-empty case is short-circuited before kernel launch.

Schedule note: the C++ source uses KernelPtrArrayTmaWarpSpecializedPingpongFP8Blockwise. This Python port keeps the cooperative schedule from the dense Example 68 variant (atom_layout_mnk = (2, 1, 1)). The sparse-groups host-side semantics are independent of pingpong vs cooperative scheduling, and the per-WG tensormap workspace required for true pingpong with grouped GEMM is left as a follow-up.

Test plan

Bit-exact match for dense (no empty groups) across all four cluster shapes
Bit-exact match with explicit empty groups (M=0, N=0, K=0 in different positions, mixed)
Bit-exact match with --sparse_fraction 0.5 / 0.9 / 1.0
compute-sanitizer --tool memcheck clean (including with --use_cold_l2)
All-empty case short-circuits before kernel launch
Per-iteration JitArguments._keepalive ensures --use_cold_l2 pre-generated workspaces survive the full benchmark
Performance comparison against the C++ Example 68 sparse-groups binary on H100

Correctness

Config	Result
Default 4 groups, dense, all four cluster shapes	PASS × 4
`--num_groups 5 --problem_sizes "(256,256,256),(0,256,256),(128,256,256),(256,0,256),(256,256,0)"`	PASS (2 active groups)
`--num_groups 8 --sparse_fraction 0.5 --seed 11`	PASS (4 active groups)
`--num_groups 8 --sparse_fraction 0.9 --seed 3`	PASS (1 active group)
`--num_groups 4 --sparse_fraction 1.0`	PASS (no work, short-circuited)

compute-sanitizer --tool memcheck, sparse + --use_cold_l2: 0 errors.

Performance (H100 80GB HBM3, FP8 E4M3FN)

All non-empty groups M=N=K=2048, --sparse_fraction 0.5 --seed 11 --cluster_shape_mn 1,1, 200 iterations + 10 warmup. The C++ binary's --m=N --groups=K mode randomizes per-group sizes, so its GFLOPS reflect a different work distribution than CuTeDSL's "exactly half-empty" — they are not directly comparable but are included as a sanity baseline:

Groups (half empty)	C++ runtime (random sizes)	C++ TFLOPS	CuTeDSL runtime	CuTeDSL TFLOPS	CuTeDSL GBPS
16 (8 active 2048³)	0.829 ms	332	0.167 ms	822	609
64 (32 active 2048³)	3.275 ms	336	0.623 ms	883	654

CuTeDSL numbers reflect actual non-empty work only and are the appropriate signal for "throughput on the active groups". For an apples-to-apples C++ comparison the maintainers should generate a benchmark file (--benchmark=path.txt) listing exactly the same half-empty problem distribution.

CuTeDSL port of CUTLASS Example 68's sparse-groups variant (68_..._grouped_gemm_with_blockwise_scaling_with_sparse_groups). Same per-row SFA (ScaleGranularityM = 1) + blockwise SFB (ScaleGranularityN = 128) FP8 grouped GEMM as the dense Example 68 port; the host driver adds support for problem distributions where many groups have zero problem sizes. Sparse-groups behaviour: - --problem_sizes accepts groups with any zero dim. Validation only requires multiples of 128 for non-empty groups. - Empty groups share a single stub GMEM allocation, so the metadata pointer table is always valid (the kernel never reads from the stubs). - The host zeros all dims of every empty group before handing the problem-size table to the kernel. The persistent group tile scheduler computes a group's tile count from M*N alone, so a group with M, N > 0 but K = 0 would otherwise consume M*N linear tile slots and offset every later group's tiles. Forcing all dims to zero makes the scheduler reserve zero linear tiles for empty groups uniformly. The original sizes are kept for the reference and bandwidth paths. - Reported GBPS uses the original (un-padded) sizes; empty groups contribute nothing to throughput. - --sparse_fraction (with --seed) randomly empties a fraction of groups; the all-empty case is short-circuited before kernel launch. Schedule note: the C++ source uses KernelPtrArrayTmaWarpSpecializedPingpongFP8Blockwise. This Python port keeps the cooperative schedule from the dense Example 68 variant (atom_layout_mnk = (2, 1, 1)). The sparse-groups host-side semantics are independent of pingpong vs cooperative scheduling, and the per-WG tensormap workspace required for true pingpong with grouped GEMM is left as a follow-up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example#3195

Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example#3195
Johnsonms wants to merge 1 commit intoNVIDIA:mainfrom
Johnsonms:cutedsl/hopper-fp8-grouped-blockwise-sparse-groups

Johnsonms commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Johnsonms commented Apr 29, 2026

Summary

Test plan

Correctness

Performance (H100 80GB HBM3, FP8 E4M3FN)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant