Skip to content

Commit 164b1db

Browse files
authored
Update categorical feature encoding details
Clarified encoding requirements for categorical features and added a note about category handling in LightGBM. Related #2761 (comment)
1 parent 9545905 commit 164b1db

1 file changed

Lines changed: 9 additions & 1 deletion

File tree

docs/Advanced-Topics.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Categorical Feature Support
2323
- Use ``categorical_feature`` to specify the categorical features.
2424
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst#categorical_feature>`__.
2525

26-
- Categorical features will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package) so they must be encoded as non-negative integers (negative values will be treated as missing)
26+
- Categorical features will be cast to ``int32`` so they must be encoded as non-negative integers (negative values will be treated as missing)
2727
less than ``Int32.MaxValue`` (2147483647).
2828
It is best to use a contiguous range of integers started from zero.
2929
Floating point numbers in categorical features will be rounded towards 0.
@@ -34,6 +34,14 @@ Categorical Feature Support
3434
treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or
3535
by embedding the categories in a low-dimensional numeric space.
3636

37+
.. note::
38+
39+
When using the Python package with a pandas ``DataFrame`` and columns of dtype ``category``,
40+
LightGBM stores the category labels observed during training and re-aligns categories at
41+
prediction time before converting them to integer codes. This ensures consistent encoding
42+
even if category order or subsets differ between training and prediction data. Categories
43+
not seen during training are treated as missing values.
44+
3745
LambdaRank
3846
----------
3947

0 commit comments

Comments
 (0)