fix: WeightedPercentileFun uses wrong CDF segment by JKDasondee · Pull Request #7224 · lightgbm-org/LightGBM

JKDasondee · 2026-04-10T03:51:18Z

Summary

Fixes #7151. WeightedPercentileFun in src/objective/regression_objective.hpp interpolated on the wrong CDF segment and could return weighted percentile values outside [min(y), max(y)].

std::upper_bound(weighted_cdf, threshold) returns the first index where weighted_cdf[pos] > threshold, so the threshold lies in segment [weighted_cdf[pos-1], weighted_cdf[pos]). The old code interpolated on [pos, pos+1] with numerator (threshold - weighted_cdf[pos]), which is always negative and can drop the result below v1 = data[sorted_idx[pos-1]].

The fix

-    if (weighted_cdf[pos + 1] - weighted_cdf[pos] >= 1.0f) {
-      return static_cast<T>((threshold - weighted_cdf[pos]) /
-                                (weighted_cdf[pos + 1] - weighted_cdf[pos]) *
+    if (weighted_cdf[pos] - weighted_cdf[pos - 1] >= 1.0) {
+      return static_cast<T>((threshold - weighted_cdf[pos - 1]) /
+                                (weighted_cdf[pos] - weighted_cdf[pos - 1]) *
                                 (v2 - v1) +
                             v1);
     } else {
-      return static_cast<T>(v2);
+      return static_cast<T>(v1);
     }

Interpolation segment fixed: [pos-1, pos] per upper_bound semantics and CHECK_GE/CHECK_LT invariants right above the branch.
Small-gap fallback: v1 instead of v2. Since threshold < weighted_cdf[pos], snapping to v1 keeps the result inside [min(y), max(y)].
1.0f → 1.0: match the surrounding double arithmetic for consistency.

This mirrors the pattern applied in #5848 for the unweighted PercentileFun.

Impact

Affects regression_l1, quantile, and mape objectives. mape uses WeightedPercentileFun for every training call because its internal label_weight_ vector drives the macro regardless of user-supplied sample weights. Both BoostFromScore and RenewTreeOutput share the fixed code path (callers at lines 242, 273, 279, 529, 559, 565, 637, 652, 658).

Example from the issue: weighted median of y=[2,3,4,5], w=[4,3,2,1]. Sorted cumulative weights [4,7,9,10], median threshold 5.0, pos=1.

Old behavior: interpolates on [cdf[1], cdf[2]] = [7, 9] using v1=3, v2=4 → (5-7)/(9-7)*(4-3) + 3 = 2.0 (boundary) and below-range values for other weight distributions.
New behavior: interpolates on [cdf[0], cdf[1]] = [4, 7] using v1=2, v2=3 → (5-4)/(7-4)*(3-2) + 2 = 2.333... ✓

Tests

Adds test_weighted_percentile_inside_label_range parametrized on regression_l1, quantile, and mape. Each variant trains a 1-iteration model on a constant feature matrix (so the BoostFromScore initialization dominates), then asserts:

predictions stay within [min(y) - 1e-6, max(y) + 1e-6]
for regression_l1: predictions equal the exact weighted median 2 + 1/3 to rtol=1e-6

Also relaxes test_mape_for_specific_boosting_types[rf|dart] from pred_mean > 8 to pred_mean > 5. That assertion's stated intent (via the inline comment linking to #1579) is to guard against MAPE predictions being stuck inside [0, 1]. The fix shifts the MAPE training output on the synthetic regression dataset from ~9 to ~6.8; the new threshold still catches the [0, 1] regression. I added a comment documenting the reasoning.

Note: RenewTreeOutput weighted paths aren't exercised directly by the new test — they share the same macro, so the fix applies, but adding dedicated coverage there is out of scope for this PR.

Verification

$ pytest tests/python_package_test/test_engine.py::test_weighted_percentile_inside_label_range \
         tests/python_package_test/test_engine.py::test_mape_for_specific_boosting_types \
         tests/python_package_test/test_engine.py::test_regression -x

11 passed in 0.61s

Built from source with sh build-python.sh install --mingw on Windows + MinGW gcc 13.2.0.

WeightedPercentileFun in src/objective/regression_objective.hpp used the segment [pos, pos+1] for linear interpolation, but upper_bound guarantees weighted_cdf[pos-1] <= threshold < weighted_cdf[pos], so the threshold lies in segment [pos-1, pos]. The old numerator (threshold - weighted_cdf[pos]) was always negative and could produce weighted percentile values below min(y) — e.g. weighted median of y=[2,3,4,5], w=[4,3,2,1] returned ~1.0 (or 2.0 in boundary cases) instead of the correct 2.333. This affects regression_l1, quantile, and mape objectives with sample weights (plus mape without weights, since its internal label_weight_ always drives WeightedPercentileFun). Both BoostFromScore and RenewTreeOutput share the macro. Align the weighted implementation with the unweighted PercentileFun and with the fix in lightgbm-org#5848 for the same class of bug. Also: - Change the small-gap fallback from v2 to v1. Since threshold is strictly below weighted_cdf[pos], snapping toward v1 keeps the result inside [min(y), max(y)]. - Use 1.0 (double) instead of 1.0f for the gap sentinel so the literal matches the surrounding weighted_cdf double arithmetic. - Loosen the pred_mean lower bound in test_mape_for_specific_boosting_types from >8 to >5. The corrected MAPE training output for the synthetic regression dataset shifts from ~9 to ~6.8. The test's intent is to guard against MAPE being stuck in [0, 1] (see lightgbm-org#1579); >5 still serves that purpose. Adds test_weighted_percentile_inside_label_range parametrized on regression_l1, quantile, and mape, using the example from the issue. Each variant asserts the resulting boosted score stays within [min(y), max(y)], and regression_l1 additionally asserts the exact weighted median 2 + 1/3. Fixes lightgbm-org#7151

JKDasondee requested review from StrikerRUS, borchero, guolinke, jameslamb, jmoralez and shiyu1994 as code owners April 10, 2026 03:51

Merge branch 'master' into fix/7151-weighted-percentile

f427cea

jameslamb requested a review from mayer79 as a code owner April 24, 2026 01:36

jameslamb added in progress fix awaiting review and removed in progress labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: WeightedPercentileFun uses wrong CDF segment#7224

fix: WeightedPercentileFun uses wrong CDF segment#7224
JKDasondee wants to merge 2 commits intolightgbm-org:masterfrom
JKDasondee:fix/7151-weighted-percentile

JKDasondee commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JKDasondee commented Apr 10, 2026

Summary

The fix

Impact

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants