Skip to content

fix: WeightedPercentileFun uses wrong CDF segment#7224

Open
JKDasondee wants to merge 2 commits intolightgbm-org:masterfrom
JKDasondee:fix/7151-weighted-percentile
Open

fix: WeightedPercentileFun uses wrong CDF segment#7224
JKDasondee wants to merge 2 commits intolightgbm-org:masterfrom
JKDasondee:fix/7151-weighted-percentile

Conversation

@JKDasondee
Copy link
Copy Markdown

Summary

Fixes #7151. WeightedPercentileFun in src/objective/regression_objective.hpp interpolated on the wrong CDF segment and could return weighted percentile values outside [min(y), max(y)].

std::upper_bound(weighted_cdf, threshold) returns the first index where weighted_cdf[pos] > threshold, so the threshold lies in segment [weighted_cdf[pos-1], weighted_cdf[pos]). The old code interpolated on [pos, pos+1] with numerator (threshold - weighted_cdf[pos]), which is always negative and can drop the result below v1 = data[sorted_idx[pos-1]].

The fix

-    if (weighted_cdf[pos + 1] - weighted_cdf[pos] >= 1.0f) {
-      return static_cast<T>((threshold - weighted_cdf[pos]) /
-                                (weighted_cdf[pos + 1] - weighted_cdf[pos]) *
+    if (weighted_cdf[pos] - weighted_cdf[pos - 1] >= 1.0) {
+      return static_cast<T>((threshold - weighted_cdf[pos - 1]) /
+                                (weighted_cdf[pos] - weighted_cdf[pos - 1]) *
                                 (v2 - v1) +
                             v1);
     } else {
-      return static_cast<T>(v2);
+      return static_cast<T>(v1);
     }
  • Interpolation segment fixed: [pos-1, pos] per upper_bound semantics and CHECK_GE/CHECK_LT invariants right above the branch.
  • Small-gap fallback: v1 instead of v2. Since threshold < weighted_cdf[pos], snapping to v1 keeps the result inside [min(y), max(y)].
  • 1.0f1.0: match the surrounding double arithmetic for consistency.

This mirrors the pattern applied in #5848 for the unweighted PercentileFun.

Impact

Affects regression_l1, quantile, and mape objectives. mape uses WeightedPercentileFun for every training call because its internal label_weight_ vector drives the macro regardless of user-supplied sample weights. Both BoostFromScore and RenewTreeOutput share the fixed code path (callers at lines 242, 273, 279, 529, 559, 565, 637, 652, 658).

Example from the issue: weighted median of y=[2,3,4,5], w=[4,3,2,1]. Sorted cumulative weights [4,7,9,10], median threshold 5.0, pos=1.

  • Old behavior: interpolates on [cdf[1], cdf[2]] = [7, 9] using v1=3, v2=4(5-7)/(9-7)*(4-3) + 3 = 2.0 (boundary) and below-range values for other weight distributions.
  • New behavior: interpolates on [cdf[0], cdf[1]] = [4, 7] using v1=2, v2=3(5-4)/(7-4)*(3-2) + 2 = 2.333...

Tests

Adds test_weighted_percentile_inside_label_range parametrized on regression_l1, quantile, and mape. Each variant trains a 1-iteration model on a constant feature matrix (so the BoostFromScore initialization dominates), then asserts:

  • predictions stay within [min(y) - 1e-6, max(y) + 1e-6]
  • for regression_l1: predictions equal the exact weighted median 2 + 1/3 to rtol=1e-6

Also relaxes test_mape_for_specific_boosting_types[rf|dart] from pred_mean > 8 to pred_mean > 5. That assertion's stated intent (via the inline comment linking to #1579) is to guard against MAPE predictions being stuck inside [0, 1]. The fix shifts the MAPE training output on the synthetic regression dataset from ~9 to ~6.8; the new threshold still catches the [0, 1] regression. I added a comment documenting the reasoning.

Note: RenewTreeOutput weighted paths aren't exercised directly by the new test — they share the same macro, so the fix applies, but adding dedicated coverage there is out of scope for this PR.

Verification

$ pytest tests/python_package_test/test_engine.py::test_weighted_percentile_inside_label_range \
         tests/python_package_test/test_engine.py::test_mape_for_specific_boosting_types \
         tests/python_package_test/test_engine.py::test_regression -x

11 passed in 0.61s

Built from source with sh build-python.sh install --mingw on Windows + MinGW gcc 13.2.0.

WeightedPercentileFun in src/objective/regression_objective.hpp used
the segment [pos, pos+1] for linear interpolation, but upper_bound
guarantees weighted_cdf[pos-1] <= threshold < weighted_cdf[pos], so
the threshold lies in segment [pos-1, pos]. The old numerator
(threshold - weighted_cdf[pos]) was always negative and could produce
weighted percentile values below min(y) — e.g. weighted median of
y=[2,3,4,5], w=[4,3,2,1] returned ~1.0 (or 2.0 in boundary cases)
instead of the correct 2.333.

This affects regression_l1, quantile, and mape objectives with sample
weights (plus mape without weights, since its internal label_weight_
always drives WeightedPercentileFun). Both BoostFromScore and
RenewTreeOutput share the macro.

Align the weighted implementation with the unweighted PercentileFun
and with the fix in lightgbm-org#5848 for the same class of bug.

Also:
- Change the small-gap fallback from v2 to v1. Since threshold is
  strictly below weighted_cdf[pos], snapping toward v1 keeps the
  result inside [min(y), max(y)].
- Use 1.0 (double) instead of 1.0f for the gap sentinel so the
  literal matches the surrounding weighted_cdf double arithmetic.
- Loosen the pred_mean lower bound in test_mape_for_specific_boosting_types
  from >8 to >5. The corrected MAPE training output for the synthetic
  regression dataset shifts from ~9 to ~6.8. The test's intent is to
  guard against MAPE being stuck in [0, 1] (see lightgbm-org#1579); >5 still
  serves that purpose.

Adds test_weighted_percentile_inside_label_range parametrized on
regression_l1, quantile, and mape, using the example from the issue.
Each variant asserts the resulting boosted score stays within
[min(y), max(y)], and regression_l1 additionally asserts the exact
weighted median 2 + 1/3.

Fixes lightgbm-org#7151
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WeightedPercentileFun uses wrong segment and can return values outside [min(y), max(y)]

2 participants