Skip to content

Support polars#7251

Closed
maxzw wants to merge 1 commit intolightgbm-org:masterfrom
maxzw:polars
Closed

Support polars#7251
maxzw wants to merge 1 commit intolightgbm-org:masterfrom
maxzw:polars

Conversation

@maxzw
Copy link
Copy Markdown

@maxzw maxzw commented May 1, 2026

❗ This is a work in progress and not ready to review!

@maxzw maxzw closed this May 1, 2026
@maxzw maxzw reopened this May 1, 2026
@maxzw maxzw closed this May 1, 2026
@jameslamb
Copy link
Copy Markdown
Member

Thanks for your interest in LightGBM.

Please don't re-open this... there has been significant prior discussion on this topic on #6204 that it appears you haven't seen, and anyway this PR was also very far from merge-able... for example, it does not even declare the polars dependency or add any tests.

We'd welcome you contributing to LightGBM in the future, but please:

  • PR titles should be specific and informative
  • PRs should have descriptions explaining the changes and explaining how they improve LightGBM
  • supporting context should be linked to (like prior discussions in GitHub issues)
  • any changes in behavior should be accompanied by unit tests

If you're interested in contributing, please start with something smaller. For example, you could try running pre-commit autoupdate and fixing anything that new versions of those hooks find.

@maxzw
Copy link
Copy Markdown
Author

maxzw commented May 2, 2026

Thank you for the feedback, @jameslamb. To clarify - I opened that PR by accident while working on my local branch and immediately closed it when I realized. I apologize for the noise.

I've read through the entire discussion on #6204 and understand the considerations, particularly around the PyCapsule Interface to avoid explicit dependencies. However, given the community interest, I wanted to ask for feedback on the general approach before investing more time: Would a Polars implementation that mirrors the existing Pandas approach be acceptable? This would provide immediate value to users while the work on PyCapsule continues.

I've prepared a PoC:

  • Direct pl.DataFrame / pl.Series support (similar to how Pandas is handled)
  • Converts to Arrow via to_arrow() where possible (zero-copy for numeric columns)
  • Handles pl.Categorical by converting to codes (same pattern as Pandas)
  • No C++ changes required
  • Would add polars as optional dependency (like pandas)

A simple benchmark shows significant performance improvements in Dataset construction when using Polars (10mln rows, 15 features)

  • ~15% faster than Pandas (~30% without categoricals)
  • ~150% faster than using polars_df.to_pandas() (~100% without categoricals)
Benchmark code

M2 Pro Mac, 16GB

import polars as pl
import pandas as pd
import numpy as np
import time

import lightgbm as lgb


def create_raw_data(n_rows: int = 10_000_000, seed: int = 42):
    """Create raw data as dict with multiple numeric and categorical columns"""
    data = {
        # Numeric columns (10 total)
        "num_float_1": np.random.rand(n_rows),
        "num_float_2": np.random.rand(n_rows) * 100,
        "num_float_3": np.random.rand(n_rows) * 1000,
        "num_int_1": np.random.randint(0, 100, size=n_rows),
        "num_int_2": np.random.randint(0, 1000, size=n_rows),
        "num_int_3": np.random.randint(-50, 50, size=n_rows),
        "num_bool_1": np.random.randint(0, 2, size=n_rows),
        "num_bool_2": np.random.randint(0, 2, size=n_rows),
        "num_norm_1": np.random.randn(n_rows),
        "num_norm_2": np.random.randn(n_rows) * 10,

        # Categorical columns (5 total, various cardinalities)
        "cat_low_1": np.random.choice(["A", "B", "C"], size=n_rows),  # 3 categories
        "cat_low_2": np.random.choice(["X", "Y", "Z", "W"], size=n_rows),  # 4 categories
        "cat_med_1": np.random.choice([f"cat_{i}" for i in range(20)], size=n_rows),  # 20 categories
        "cat_med_2": np.random.choice([f"type_{i}" for i in range(50)], size=n_rows),  # 50 categories
        "cat_high": np.random.choice([f"id_{i}" for i in range(200)], size=n_rows),  # 200 categories

        "target": np.random.randint(0, 2, size=n_rows),
    }
    return data


def create_pandas_data(n_rows: int = 10_000_000):
    """Create a pandas DataFrame from raw data dict"""
    raw_data = create_raw_data(n_rows)
    df = pd.DataFrame(raw_data)
    # Convert categorical columns to 'category' dtype
    cat_cols = [col for col in df.columns if col.startswith("cat_")]
    for col in cat_cols:
        df[col] = df[col].astype("category")
    return df.drop("target", axis=1), df["target"]


def create_polars_data(n_rows: int = 10_000_000):
    """Create a Polars DataFrame from raw data dict"""
    raw_data = create_raw_data(n_rows)
    df = pl.DataFrame(raw_data)
    # Convert categorical columns to Categorical dtype
    cat_cols = [col for col in df.columns if col.startswith("cat_")]
    df = df.with_columns([pl.col(col).cast(pl.Categorical) for col in cat_cols])
    return df.drop("target"), df["target"]


X_pandas, y_pandas = create_pandas_data()
X_polars, y_polars = create_polars_data()

# First calculate pandas performance
times = []
for _ in range(20):
    start = time.perf_counter()
    dataset = lgb.Dataset(X_pandas, label=y_pandas)
    dataset.construct()    
    elapsed = (time.perf_counter() - start) * 1000
    times.append(elapsed)
avg_time_pandas = sum(times) / len(times)
print(f"Pandas dataset construction: {avg_time_pandas:.2f} ms")

# Now calculate Polars performance
times = []
for _ in range(20):
    start = time.perf_counter()
    dataset = lgb.Dataset(X_polars, label=y_polars)
    dataset.construct()    
    elapsed = (time.perf_counter() - start) * 1000
    times.append(elapsed)
avg_time_polars = sum(times) / len(times)
print(f"Polars dataset construction: {avg_time_polars:.2f} ms")

# Now calculate Polars performance with cast .to_pandas()
times = []
for _ in range(20):
    start = time.perf_counter()
    dataset = lgb.Dataset(X_polars.to_pandas(), label=y_polars.to_pandas())
    dataset.construct()    
    elapsed = (time.perf_counter() - start) * 1000
    times.append(elapsed)
avg_time_polars_to_pandas = sum(times) / len(times)
print(f"Polars with .to_pandas() dataset construction: {avg_time_polars_to_pandas:.2f} ms")

# Now Pandas performance without categoricals
cat_cols = [col for col in X_pandas.columns if col.startswith("cat_")]
X_pandas_nocats = X_pandas.drop(cat_cols, axis=1)
times = []
for _ in range(20):
    start = time.perf_counter()
    dataset = lgb.Dataset(X_pandas_nocats, label=y_pandas)
    dataset.construct()    
    elapsed = (time.perf_counter() - start) * 1000
    times.append(elapsed)
avg_time_pandas_nocats = sum(times) / len(times)
print(f"Pandas dataset construction (no categoricals): {avg_time_pandas_nocats:.2f} ms")

# Now Polars performance without categoricals
cat_cols = [col for col in X_polars.columns if col.startswith("cat_")]
X_polars_nocats = X_polars.drop(cat_cols)
times = []
for _ in range(20):
    start = time.perf_counter()
    dataset = lgb.Dataset(X_polars_nocats, label=y_polars)
    dataset.construct()    
    elapsed = (time.perf_counter() - start) * 1000
    times.append(elapsed)
avg_time_polars_nocats = sum(times) / len(times)
print(f"Polars dataset construction (no categoricals): {avg_time_polars_nocats:.2f} ms")


# Now calculate Polars performance with cast .to_pandas() without categoricals
times = []
for _ in range(20):
    start = time.perf_counter()
    dataset = lgb.Dataset(X_polars_nocats.to_pandas(), label=y_polars.to_pandas())
    dataset.construct()    
    elapsed = (time.perf_counter() - start) * 1000
    times.append(elapsed)
avg_time_polars_nocats_to_pandas = sum(times) / len(times)
print(f"Polars (no categoricals) with .to_pandas() dataset construction: {avg_time_polars_nocats_to_pandas:.2f} ms")

print("\nBenchmark completed successfully!")
print(f"Polars vs Pandas: {((avg_time_pandas - avg_time_polars) / avg_time_pandas) * 100:.2f}% faster")
print(f"Polars vs Polars with .to_pandas(): {((avg_time_polars_to_pandas - avg_time_polars) / avg_time_polars) * 100:.2f}% faster")
print(f"Polars (no categoricals) vs Pandas (no categoricals): {((avg_time_pandas_nocats - avg_time_polars_nocats) / avg_time_pandas_nocats) * 100:.2f}% faster")
print(f"Polars (no categoricals) vs Polars with .to_pandas() (no categoricals): {((avg_time_polars_nocats_to_pandas - avg_time_polars_nocats) / avg_time_polars_nocats) * 100:.2f}% faster")

My output:

Pandas dataset construction: 690.00 ms
Polars dataset construction: 577.58 ms
Polars with .to_pandas() dataset construction: 1419.10 ms
Pandas dataset construction (no categoricals): 503.47 ms
Polars dataset construction (no categoricals): 355.62 ms
Polars (no categoricals) with .to_pandas() dataset construction: 699.53 ms

Benchmark completed successfully
Polars vs Pandas: 16.29% faster
Polars vs Polars with .to_pandas(): 145.70% faster
Polars (no categoricals) vs Pandas (no categoricals): 29.37% faster
Polars (no categoricals) vs Polars with .to_pandas() (no categoricals): 96.71% faster

Alternatively, I'm happy to:

  • Wait for the PyCapsule work to be completed
  • Start with a smaller contribution to build familiarity with the project

What approach would you prefer? Thanks for your guidance!

@jameslamb
Copy link
Copy Markdown
Member

Thanks for that!

I've read through the entire discussion on #6204 and understand the considerations, particularly around the PyCapsule Interface to avoid explicit dependencies.

Sorry if my assumption that you hadn't read the previous discussion seemed a bit harsh... you hadn't linked to it and the code was very incomplete. As you might expect we've been receiving a lot of low-effort LLM-generated PRs here so I'm much more skeptical these days of unexpected PRs from first-time contributors.

Start with a smaller contribution to build familiarity with the project

Would you be willing to help us support pyarrow data types in pandas columns?

See the discussion in #5739 (comment)

That'd be a smaller contribution that would get you very familiar with this same part of the Python codebase that you'd like to add polars support to, and would be very very much appreciated (it's been unaddressed for 2 years).

Would a Polars implementation that mirrors the existing Pandas approach be acceptable? This would provide immediate value to users while the work on PyCapsule continues.

I'd been opposed to adding a polars dependency (even an optional one) mostly out of fatigue from supporting so many different DataFrame input types.

But I do agree that here in May 2026, lightgbm really should have some sort of low-friction polars support that doesn't require used to call .to_arrow() or similar.

I'm open to adding something like what you've proposed as a "for now" solution until we can hopefully do the PyCapsule or some other lighter-weight approach. That's not a promise that it'd be merged, but just me saying "I'm open to it, let's see what it looks like".

But ultimately, this is @borchero 's area of expertise and something I know he wanted to work on, so I will defer to him.


To summarize, I'd support this:

  1. put up a new PR helping us with pandas pyarrow types if you have interest? ([python-package] support pandas 2.0 #5739)
  2. put up your PoC as a DRAFT PR, post a comment on [python-package] Adding support for polars for input data  #6204 pointing to it and asking @borchero for his thoughts on that approach

@maxzw
Copy link
Copy Markdown
Author

maxzw commented May 4, 2026

@jameslamb Thanks! I'll try to polish this PR some more this week and mention it in #6204 for everyone's consideration. I'll also try to look at supporting pyarrow datatypes 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants