[cuda] fix half-size allocation of discretized gradient buffer#7254
Open
BelixRogner wants to merge 2 commits intolightgbm-org:masterfrom
Open
[cuda] fix half-size allocation of discretized gradient buffer#7254BelixRogner wants to merge 2 commits intolightgbm-org:masterfrom
BelixRogner wants to merge 2 commits intolightgbm-org:masterfrom
Conversation
CUDAGradientDiscretizer::Init resizes discretized_gradients_and_hessians_ (a CUDAVector<int8_t>) to num_data * 2 elements. The DiscretizeGradientsKernel writes a *pair* of int16 values per data row (gradient + hessian) at offsets 4*index and 4*index+2 — i.e. it needs num_data * 4 bytes, not 2. The current allocation is half the required size; the kernel writes past the end of the buffer for the upper half of the data. compute-sanitizer flags this as an Invalid __global__ write of size 2 bytes for any `use_quantized_grad=true` run on >~3M rows on a 32-bit-aligned device. Resize to num_data * 4 so the buffer holds the full int16 pairs without overrunning. No effect when use_quantized_grad is false. Signed-off-by: Felix Jonas Kroner <fksnake@gmail.com>
jameslamb
reviewed
May 3, 2026
Member
jameslamb
left a comment
There was a problem hiding this comment.
@shiyu1994 could you review this one and see if it makes sense?
Could this help with #6703 ?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CUDAGradientDiscretizer::Initresizesdiscretized_gradients_and_hessians_(aCUDAVector<int8_t>) tonum_data * 2elements. TheDiscretizeGradientsKernelwrites a pair of int16 values per data row (gradient + hessian) at byte offsets4*indexand4*index+2— i.e. it needsnum_data * 4bytes, not 2.The current allocation is half the required size; the kernel writes past the end of the buffer for the upper half of the data.
Fix
Resize(num_data * 4)so the buffer holds the full int16 pairs without overrunning. No effect whenuse_quantized_gradis false. One line.Reproducer
Any
device='cuda'training withuse_quantized_grad=Trueon more than a few million rows.compute-sanitizerflags it asInvalid __global__ write of size 2 bytes ... is N bytes after the nearest allocationatcuda_gradient_discretizer.cu:115.Test plan
-DUSE_CUDA=1on sm_120 (RTX 5090).compute-sanitizerwrites-past-end after the fix.