feat: Metal GPU backend for Apple Silicon by discordwell · Pull Request #7214 · lightgbm-org/LightGBM

discordwell · 2026-03-31T12:46:58Z

Summary

Adds a native Metal compute shader backend for GPU-accelerated histogram construction on Apple Silicon Macs. This fills the gap left by the deprecated OpenCL backend which crashes on Apple Silicon (#6189).

New device_type="metal" option for training on Apple Silicon GPUs
Three Metal compute kernels ported from the OpenCL originals (histogram16/64/256)
MetalTreeLearner using Objective-C++ with unified memory (no pinned buffers needed)
Runtime .metal source compilation when build-time metallib is unavailable
14-test suite covering binary/regression/multiclass, all 3 bin variants, scalability, bagging

Design

Follows the existing OpenCL GPU backend architecture:

MetalTreeLearner extends SerialTreeLearner — only histogram construction on GPU
Split finding, data partitioning, tree construction remain on CPU
Apple Silicon unified memory eliminates PCIe transfer overhead (simpler than OpenCL)

Test results (Apple M4 Max)

14 passed in 1.60s
- Binary classification, regression, multiclass ✓
- histogram16 (max_bin≤16), histogram64 (max_bin≤64), histogram256 (max_bin≤255) ✓
- Dataset sizes 100 to 10,000 ✓
- Bagging, constant hessian, gpu_use_dp override ✓
- Max prediction diff vs CPU: <5e-6 (for n≥1000)

Known limitations

Single workgroup per feature (POWER=0) — limits GPU occupancy for large datasets. Multi-workgroup reduction kernels are included but have a layout mismatch that needs investigation. This will be addressed in a follow-up.
Metal Toolchain must be installed for build-time kernel compilation; falls back to runtime compilation from .metal source files.
macOS only, Apple Silicon only, FP32 only (no gpu_use_dp).

Build

cmake -DUSE_METAL=ON ..
make -j

Usage

params = {'device': 'metal', 'objective': 'binary', ...}
model = lgb.train(params, train_data)

Test plan

🤖 Generated with Claude Code

Add a native Metal compute shader backend for GPU-accelerated histogram construction on Apple Silicon Macs. This addresses the gap left by the deprecated OpenCL backend, which crashes on Apple Silicon (lightgbm-org#6189). The Metal backend follows the same architecture as the existing OpenCL GPU backend (histogram-only acceleration; split finding stays on CPU), but is significantly simpler due to Apple Silicon's unified memory — no pinned buffers, no async PCIe transfers needed. New files: - metal_tree_learner.h/.mm: MetalTreeLearner extending SerialTreeLearner - metal/histogram{16,64,256}.metal: Metal compute kernels ported from OCL Modified: - CMakeLists.txt: USE_METAL option, Metal framework linking, metallib build - config.h/config.cpp: device_type="metal" support - tree_learner.cpp: factory method for Metal learner - parallel/linear learner files: MetalTreeLearner template instantiations Build: cmake -DUSE_METAL=ON .. (macOS only) Usage: lgb.train({..., 'device': 'metal'}) Tested on Apple M4 Max with datasets up to 50K rows, all three kernel variants (max_bin 16/64/256). Max prediction diff vs CPU: <5e-6. Known limitation: currently uses single workgroup per feature (POWER=0) due to Metal's lack of cross-threadgroup memory synchronization within a single dispatch. Multi-workgroup support will follow via a two-pass reduction kernel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add reduce_histogram{16,64,256} kernels for future multi-workgroup support - Strip within-kernel cross-threadgroup sync from multi-WG path (unreliable on Metal) - Keep single-workgroup (POWER=0) as default for guaranteed correctness - Fix buffer index mismatch: hessians=6, const_hessian=7, output=8, sync=9, hist=10 - Add comprehensive test suite (test_metal.py): 14 tests covering binary, regression, multiclass, all 3 kernel variants, scalability to 10K rows, bagging, constant hessian - All 14 tests pass on Apple M4 Max Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

discordwell · 2026-03-31T12:52:47Z

Closing — opened prematurely, needs more work on multi-workgroup reduction before it's ready for review.

jameslamb · 2026-04-01T02:59:05Z

Replaced by #7215, locking this.

Hermetian and others added 2 commits March 31, 2026 05:32

jameslamb added wontfix invalid and removed wontfix labels Mar 31, 2026

lightgbm-org locked as resolved and limited conversation to collaborators Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Metal GPU backend for Apple Silicon#7214

feat: Metal GPU backend for Apple Silicon#7214
discordwell wants to merge 2 commits intolightgbm-org:masterfrom
discordwell:feature/metal-backend

discordwell commented Mar 31, 2026

Uh oh!

discordwell commented Mar 31, 2026

Uh oh!

jameslamb commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

discordwell commented Mar 31, 2026

Summary

Design

Test results (Apple M4 Max)

Known limitations

Build

Usage

Test plan

Uh oh!

discordwell commented Mar 31, 2026

Uh oh!

jameslamb commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants