Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example#3195
Draft
Johnsonms wants to merge 1 commit intoNVIDIA:mainfrom
Draft
Add Hopper FP8 grouped blockwise GEMM (sparse-groups) CuTeDSL example#3195Johnsonms wants to merge 1 commit intoNVIDIA:mainfrom
Johnsonms wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
CuTeDSL port of CUTLASS Example 68's sparse-groups variant
(68_..._grouped_gemm_with_blockwise_scaling_with_sparse_groups). Same
per-row SFA (ScaleGranularityM = 1) + blockwise SFB (ScaleGranularityN
= 128) FP8 grouped GEMM as the dense Example 68 port; the host driver
adds support for problem distributions where many groups have zero
problem sizes.
Sparse-groups behaviour:
- --problem_sizes accepts groups with any zero dim. Validation only
requires multiples of 128 for non-empty groups.
- Empty groups share a single stub GMEM allocation, so the metadata
pointer table is always valid (the kernel never reads from the
stubs).
- The host zeros all dims of every empty group before handing the
problem-size table to the kernel. The persistent group tile
scheduler computes a group's tile count from M*N alone, so a group
with M, N > 0 but K = 0 would otherwise consume M*N linear tile
slots and offset every later group's tiles. Forcing all dims to
zero makes the scheduler reserve zero linear tiles for empty
groups uniformly. The original sizes are kept for the reference
and bandwidth paths.
- Reported GBPS uses the original (un-padded) sizes; empty groups
contribute nothing to throughput.
- --sparse_fraction (with --seed) randomly empties a fraction of
groups; the all-empty case is short-circuited before kernel launch.
Schedule note: the C++ source uses
KernelPtrArrayTmaWarpSpecializedPingpongFP8Blockwise. This Python port
keeps the cooperative schedule from the dense Example 68 variant
(atom_layout_mnk = (2, 1, 1)). The sparse-groups host-side semantics
are independent of pingpong vs cooperative scheduling, and the per-WG
tensormap workspace required for true pingpong with grouped GEMM is
left as a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a CuTeDSL port of CUTLASS Example 68's sparse-groups variant (
68_..._grouped_gemm_with_blockwise_scaling_with_sparse_groups.cu) atexamples/python/CuTeDSL/hopper/dense_gemm_fp8_grouped_blockwise_sparse_groups.py.Same per-row SFA (
ScaleGranularityM = 1) + blockwise SFB (ScaleGranularityN = 128) FP8 grouped GEMM as the dense Example 68 port, plus host-side support for problem distributions where many groups have zero problem sizes.Sparse-groups behaviour:
--problem_sizesaccepts groups with any zero dim. Validation only requires multiples of 128 for non-empty groups.--sparse_fraction(with--seed) randomly empties a fraction of groups; the all-empty case is short-circuited before kernel launch.Schedule note: the C++ source uses
KernelPtrArrayTmaWarpSpecializedPingpongFP8Blockwise. This Python port keeps the cooperative schedule from the dense Example 68 variant (atom_layout_mnk = (2, 1, 1)). The sparse-groups host-side semantics are independent of pingpong vs cooperative scheduling, and the per-WG tensormap workspace required for true pingpong with grouped GEMM is left as a follow-up.Test plan
--sparse_fraction 0.5 / 0.9 / 1.0compute-sanitizer --tool memcheckclean (including with--use_cold_l2)JitArguments._keepaliveensures--use_cold_l2pre-generated workspaces survive the full benchmarkCorrectness
--num_groups 5 --problem_sizes "(256,256,256),(0,256,256),(128,256,256),(256,0,256),(256,256,0)"--num_groups 8 --sparse_fraction 0.5 --seed 11--num_groups 8 --sparse_fraction 0.9 --seed 3--num_groups 4 --sparse_fraction 1.0compute-sanitizer --tool memcheck, sparse +--use_cold_l2: 0 errors.Performance (H100 80GB HBM3, FP8 E4M3FN)
All non-empty groups M=N=K=2048,
--sparse_fraction 0.5 --seed 11 --cluster_shape_mn 1,1, 200 iterations + 10 warmup. The C++ binary's--m=N --groups=Kmode randomizes per-group sizes, so its GFLOPS reflect a different work distribution than CuTeDSL's "exactly half-empty" — they are not directly comparable but are included as a sanity baseline:CuTeDSL numbers reflect actual non-empty work only and are the appropriate signal for "throughput on the active groups". For an apples-to-apples C++ comparison the maintainers should generate a benchmark file (
--benchmark=path.txt) listing exactly the same half-empty problem distribution.