Skip to content

[cuda] fix half-size allocation of discretized gradient buffer#7254

Open
BelixRogner wants to merge 2 commits intolightgbm-org:masterfrom
BelixRogner:fix-cuda-discretizer-buffer-size
Open

[cuda] fix half-size allocation of discretized gradient buffer#7254
BelixRogner wants to merge 2 commits intolightgbm-org:masterfrom
BelixRogner:fix-cuda-discretizer-buffer-size

Conversation

@BelixRogner
Copy link
Copy Markdown

Summary

CUDAGradientDiscretizer::Init resizes discretized_gradients_and_hessians_ (a CUDAVector<int8_t>) to num_data * 2 elements. The DiscretizeGradientsKernel writes a pair of int16 values per data row (gradient + hessian) at byte offsets 4*index and 4*index+2 — i.e. it needs num_data * 4 bytes, not 2.

The current allocation is half the required size; the kernel writes past the end of the buffer for the upper half of the data.

Fix

Resize(num_data * 4) so the buffer holds the full int16 pairs without overrunning. No effect when use_quantized_grad is false. One line.

Reproducer

Any device='cuda' training with use_quantized_grad=True on more than a few million rows. compute-sanitizer flags it as Invalid __global__ write of size 2 bytes ... is N bytes after the nearest allocation at cuda_gradient_discretizer.cu:115.

Test plan

  • Build with -DUSE_CUDA=1 on sm_120 (RTX 5090).
  • Run quantized-grad training on 6.7M rows — no compute-sanitizer writes-past-end after the fix.

CUDAGradientDiscretizer::Init resizes discretized_gradients_and_hessians_
(a CUDAVector<int8_t>) to num_data * 2 elements. The DiscretizeGradientsKernel
writes a *pair* of int16 values per data row (gradient + hessian) at
offsets 4*index and 4*index+2 — i.e. it needs num_data * 4 bytes, not 2.

The current allocation is half the required size; the kernel writes past
the end of the buffer for the upper half of the data. compute-sanitizer
flags this as an Invalid __global__ write of size 2 bytes for any
`use_quantized_grad=true` run on >~3M rows on a 32-bit-aligned device.

Resize to num_data * 4 so the buffer holds the full int16 pairs without
overrunning. No effect when use_quantized_grad is false.

Signed-off-by: Felix Jonas Kroner <fksnake@gmail.com>
@jameslamb jameslamb added fix gpu (CUDA) Issue is related to the CUDA GPU variant. labels May 3, 2026
@jameslamb jameslamb assigned jameslamb and shiyu1994 and unassigned jameslamb May 3, 2026
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 could you review this one and see if it makes sense?

Could this help with #6703 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review fix gpu (CUDA) Issue is related to the CUDA GPU variant.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants