Support polars#7251
Conversation
|
Thanks for your interest in LightGBM. Please don't re-open this... there has been significant prior discussion on this topic on #6204 that it appears you haven't seen, and anyway this PR was also very far from merge-able... for example, it does not even declare the We'd welcome you contributing to LightGBM in the future, but please:
If you're interested in contributing, please start with something smaller. For example, you could try running |
|
Thank you for the feedback, @jameslamb. To clarify - I opened that PR by accident while working on my local branch and immediately closed it when I realized. I apologize for the noise. I've read through the entire discussion on #6204 and understand the considerations, particularly around the PyCapsule Interface to avoid explicit dependencies. However, given the community interest, I wanted to ask for feedback on the general approach before investing more time: Would a Polars implementation that mirrors the existing Pandas approach be acceptable? This would provide immediate value to users while the work on PyCapsule continues. I've prepared a PoC:
A simple benchmark shows significant performance improvements in
Benchmark codeM2 Pro Mac, 16GB import polars as pl
import pandas as pd
import numpy as np
import time
import lightgbm as lgb
def create_raw_data(n_rows: int = 10_000_000, seed: int = 42):
"""Create raw data as dict with multiple numeric and categorical columns"""
data = {
# Numeric columns (10 total)
"num_float_1": np.random.rand(n_rows),
"num_float_2": np.random.rand(n_rows) * 100,
"num_float_3": np.random.rand(n_rows) * 1000,
"num_int_1": np.random.randint(0, 100, size=n_rows),
"num_int_2": np.random.randint(0, 1000, size=n_rows),
"num_int_3": np.random.randint(-50, 50, size=n_rows),
"num_bool_1": np.random.randint(0, 2, size=n_rows),
"num_bool_2": np.random.randint(0, 2, size=n_rows),
"num_norm_1": np.random.randn(n_rows),
"num_norm_2": np.random.randn(n_rows) * 10,
# Categorical columns (5 total, various cardinalities)
"cat_low_1": np.random.choice(["A", "B", "C"], size=n_rows), # 3 categories
"cat_low_2": np.random.choice(["X", "Y", "Z", "W"], size=n_rows), # 4 categories
"cat_med_1": np.random.choice([f"cat_{i}" for i in range(20)], size=n_rows), # 20 categories
"cat_med_2": np.random.choice([f"type_{i}" for i in range(50)], size=n_rows), # 50 categories
"cat_high": np.random.choice([f"id_{i}" for i in range(200)], size=n_rows), # 200 categories
"target": np.random.randint(0, 2, size=n_rows),
}
return data
def create_pandas_data(n_rows: int = 10_000_000):
"""Create a pandas DataFrame from raw data dict"""
raw_data = create_raw_data(n_rows)
df = pd.DataFrame(raw_data)
# Convert categorical columns to 'category' dtype
cat_cols = [col for col in df.columns if col.startswith("cat_")]
for col in cat_cols:
df[col] = df[col].astype("category")
return df.drop("target", axis=1), df["target"]
def create_polars_data(n_rows: int = 10_000_000):
"""Create a Polars DataFrame from raw data dict"""
raw_data = create_raw_data(n_rows)
df = pl.DataFrame(raw_data)
# Convert categorical columns to Categorical dtype
cat_cols = [col for col in df.columns if col.startswith("cat_")]
df = df.with_columns([pl.col(col).cast(pl.Categorical) for col in cat_cols])
return df.drop("target"), df["target"]
X_pandas, y_pandas = create_pandas_data()
X_polars, y_polars = create_polars_data()
# First calculate pandas performance
times = []
for _ in range(20):
start = time.perf_counter()
dataset = lgb.Dataset(X_pandas, label=y_pandas)
dataset.construct()
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time_pandas = sum(times) / len(times)
print(f"Pandas dataset construction: {avg_time_pandas:.2f} ms")
# Now calculate Polars performance
times = []
for _ in range(20):
start = time.perf_counter()
dataset = lgb.Dataset(X_polars, label=y_polars)
dataset.construct()
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time_polars = sum(times) / len(times)
print(f"Polars dataset construction: {avg_time_polars:.2f} ms")
# Now calculate Polars performance with cast .to_pandas()
times = []
for _ in range(20):
start = time.perf_counter()
dataset = lgb.Dataset(X_polars.to_pandas(), label=y_polars.to_pandas())
dataset.construct()
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time_polars_to_pandas = sum(times) / len(times)
print(f"Polars with .to_pandas() dataset construction: {avg_time_polars_to_pandas:.2f} ms")
# Now Pandas performance without categoricals
cat_cols = [col for col in X_pandas.columns if col.startswith("cat_")]
X_pandas_nocats = X_pandas.drop(cat_cols, axis=1)
times = []
for _ in range(20):
start = time.perf_counter()
dataset = lgb.Dataset(X_pandas_nocats, label=y_pandas)
dataset.construct()
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time_pandas_nocats = sum(times) / len(times)
print(f"Pandas dataset construction (no categoricals): {avg_time_pandas_nocats:.2f} ms")
# Now Polars performance without categoricals
cat_cols = [col for col in X_polars.columns if col.startswith("cat_")]
X_polars_nocats = X_polars.drop(cat_cols)
times = []
for _ in range(20):
start = time.perf_counter()
dataset = lgb.Dataset(X_polars_nocats, label=y_polars)
dataset.construct()
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time_polars_nocats = sum(times) / len(times)
print(f"Polars dataset construction (no categoricals): {avg_time_polars_nocats:.2f} ms")
# Now calculate Polars performance with cast .to_pandas() without categoricals
times = []
for _ in range(20):
start = time.perf_counter()
dataset = lgb.Dataset(X_polars_nocats.to_pandas(), label=y_polars.to_pandas())
dataset.construct()
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time_polars_nocats_to_pandas = sum(times) / len(times)
print(f"Polars (no categoricals) with .to_pandas() dataset construction: {avg_time_polars_nocats_to_pandas:.2f} ms")
print("\nBenchmark completed successfully!")
print(f"Polars vs Pandas: {((avg_time_pandas - avg_time_polars) / avg_time_pandas) * 100:.2f}% faster")
print(f"Polars vs Polars with .to_pandas(): {((avg_time_polars_to_pandas - avg_time_polars) / avg_time_polars) * 100:.2f}% faster")
print(f"Polars (no categoricals) vs Pandas (no categoricals): {((avg_time_pandas_nocats - avg_time_polars_nocats) / avg_time_pandas_nocats) * 100:.2f}% faster")
print(f"Polars (no categoricals) vs Polars with .to_pandas() (no categoricals): {((avg_time_polars_nocats_to_pandas - avg_time_polars_nocats) / avg_time_polars_nocats) * 100:.2f}% faster")My output: Alternatively, I'm happy to:
What approach would you prefer? Thanks for your guidance! |
|
Thanks for that!
Sorry if my assumption that you hadn't read the previous discussion seemed a bit harsh... you hadn't linked to it and the code was very incomplete. As you might expect we've been receiving a lot of low-effort LLM-generated PRs here so I'm much more skeptical these days of unexpected PRs from first-time contributors.
Would you be willing to help us support See the discussion in #5739 (comment) That'd be a smaller contribution that would get you very familiar with this same part of the Python codebase that you'd like to add
I'd been opposed to adding a But I do agree that here in May 2026, I'm open to adding something like what you've proposed as a "for now" solution until we can hopefully do the PyCapsule or some other lighter-weight approach. That's not a promise that it'd be merged, but just me saying "I'm open to it, let's see what it looks like". But ultimately, this is @borchero 's area of expertise and something I know he wanted to work on, so I will defer to him. To summarize, I'd support this:
|
|
@jameslamb Thanks! I'll try to polish this PR some more this week and mention it in #6204 for everyone's consideration. I'll also try to look at supporting |
❗ This is a work in progress and not ready to review!