fix: WeightedPercentileFun uses wrong CDF segment#7224
Open
JKDasondee wants to merge 2 commits intolightgbm-org:masterfrom
Open
fix: WeightedPercentileFun uses wrong CDF segment#7224JKDasondee wants to merge 2 commits intolightgbm-org:masterfrom
JKDasondee wants to merge 2 commits intolightgbm-org:masterfrom
Conversation
WeightedPercentileFun in src/objective/regression_objective.hpp used the segment [pos, pos+1] for linear interpolation, but upper_bound guarantees weighted_cdf[pos-1] <= threshold < weighted_cdf[pos], so the threshold lies in segment [pos-1, pos]. The old numerator (threshold - weighted_cdf[pos]) was always negative and could produce weighted percentile values below min(y) — e.g. weighted median of y=[2,3,4,5], w=[4,3,2,1] returned ~1.0 (or 2.0 in boundary cases) instead of the correct 2.333. This affects regression_l1, quantile, and mape objectives with sample weights (plus mape without weights, since its internal label_weight_ always drives WeightedPercentileFun). Both BoostFromScore and RenewTreeOutput share the macro. Align the weighted implementation with the unweighted PercentileFun and with the fix in lightgbm-org#5848 for the same class of bug. Also: - Change the small-gap fallback from v2 to v1. Since threshold is strictly below weighted_cdf[pos], snapping toward v1 keeps the result inside [min(y), max(y)]. - Use 1.0 (double) instead of 1.0f for the gap sentinel so the literal matches the surrounding weighted_cdf double arithmetic. - Loosen the pred_mean lower bound in test_mape_for_specific_boosting_types from >8 to >5. The corrected MAPE training output for the synthetic regression dataset shifts from ~9 to ~6.8. The test's intent is to guard against MAPE being stuck in [0, 1] (see lightgbm-org#1579); >5 still serves that purpose. Adds test_weighted_percentile_inside_label_range parametrized on regression_l1, quantile, and mape, using the example from the issue. Each variant asserts the resulting boosted score stays within [min(y), max(y)], and regression_l1 additionally asserts the exact weighted median 2 + 1/3. Fixes lightgbm-org#7151
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #7151.
WeightedPercentileFuninsrc/objective/regression_objective.hppinterpolated on the wrong CDF segment and could return weighted percentile values outside[min(y), max(y)].std::upper_bound(weighted_cdf, threshold)returns the first index whereweighted_cdf[pos] > threshold, so the threshold lies in segment[weighted_cdf[pos-1], weighted_cdf[pos]). The old code interpolated on[pos, pos+1]with numerator(threshold - weighted_cdf[pos]), which is always negative and can drop the result belowv1 = data[sorted_idx[pos-1]].The fix
[pos-1, pos]perupper_boundsemantics andCHECK_GE/CHECK_LTinvariants right above the branch.v1instead ofv2. Sincethreshold < weighted_cdf[pos], snapping tov1keeps the result inside[min(y), max(y)].1.0f→1.0: match the surroundingdoublearithmetic for consistency.This mirrors the pattern applied in #5848 for the unweighted
PercentileFun.Impact
Affects
regression_l1,quantile, andmapeobjectives.mapeusesWeightedPercentileFunfor every training call because its internallabel_weight_vector drives the macro regardless of user-supplied sample weights. BothBoostFromScoreandRenewTreeOutputshare the fixed code path (callers at lines 242, 273, 279, 529, 559, 565, 637, 652, 658).Example from the issue: weighted median of
y=[2,3,4,5],w=[4,3,2,1]. Sorted cumulative weights[4,7,9,10], median threshold5.0,pos=1.[cdf[1], cdf[2]] = [7, 9]usingv1=3, v2=4→(5-7)/(9-7)*(4-3) + 3 = 2.0(boundary) and below-range values for other weight distributions.[cdf[0], cdf[1]] = [4, 7]usingv1=2, v2=3→(5-4)/(7-4)*(3-2) + 2 = 2.333...✓Tests
Adds
test_weighted_percentile_inside_label_rangeparametrized onregression_l1,quantile, andmape. Each variant trains a 1-iteration model on a constant feature matrix (so the BoostFromScore initialization dominates), then asserts:[min(y) - 1e-6, max(y) + 1e-6]regression_l1: predictions equal the exact weighted median2 + 1/3tortol=1e-6Also relaxes
test_mape_for_specific_boosting_types[rf|dart]frompred_mean > 8topred_mean > 5. That assertion's stated intent (via the inline comment linking to #1579) is to guard against MAPE predictions being stuck inside[0, 1]. The fix shifts the MAPE training output on the synthetic regression dataset from ~9 to ~6.8; the new threshold still catches the[0, 1]regression. I added a comment documenting the reasoning.Verification
Built from source with
sh build-python.sh install --mingwon Windows + MinGW gcc 13.2.0.