Skip to content

feat: Metal GPU backend for Apple Silicon#7214

Closed
discordwell wants to merge 2 commits intolightgbm-org:masterfrom
discordwell:feature/metal-backend
Closed

feat: Metal GPU backend for Apple Silicon#7214
discordwell wants to merge 2 commits intolightgbm-org:masterfrom
discordwell:feature/metal-backend

Conversation

@discordwell
Copy link
Copy Markdown

Summary

Adds a native Metal compute shader backend for GPU-accelerated histogram construction on Apple Silicon Macs. This fills the gap left by the deprecated OpenCL backend which crashes on Apple Silicon (#6189).

  • New device_type="metal" option for training on Apple Silicon GPUs
  • Three Metal compute kernels ported from the OpenCL originals (histogram16/64/256)
  • MetalTreeLearner using Objective-C++ with unified memory (no pinned buffers needed)
  • Runtime .metal source compilation when build-time metallib is unavailable
  • 14-test suite covering binary/regression/multiclass, all 3 bin variants, scalability, bagging

Design

Follows the existing OpenCL GPU backend architecture:

  • MetalTreeLearner extends SerialTreeLearner — only histogram construction on GPU
  • Split finding, data partitioning, tree construction remain on CPU
  • Apple Silicon unified memory eliminates PCIe transfer overhead (simpler than OpenCL)

Test results (Apple M4 Max)

14 passed in 1.60s
- Binary classification, regression, multiclass ✓
- histogram16 (max_bin≤16), histogram64 (max_bin≤64), histogram256 (max_bin≤255) ✓
- Dataset sizes 100 to 10,000 ✓
- Bagging, constant hessian, gpu_use_dp override ✓
- Max prediction diff vs CPU: <5e-6 (for n≥1000)

Known limitations

  • Single workgroup per feature (POWER=0) — limits GPU occupancy for large datasets. Multi-workgroup reduction kernels are included but have a layout mismatch that needs investigation. This will be addressed in a follow-up.
  • Metal Toolchain must be installed for build-time kernel compilation; falls back to runtime compilation from .metal source files.
  • macOS only, Apple Silicon only, FP32 only (no gpu_use_dp).

Build

cmake -DUSE_METAL=ON ..
make -j

Usage

params = {'device': 'metal', 'objective': 'binary', ...}
model = lgb.train(params, train_data)

Test plan

  • Binary classification correctness vs CPU
  • Regression correctness vs CPU
  • Multiclass correctness vs CPU
  • All 3 histogram kernel variants (max_bin 15/63/255)
  • Dataset sizes 100 to 10,000
  • Bagging mode
  • Constant hessian path
  • gpu_use_dp=true gracefully degrades to FP32
  • Benchmark on multiple Apple Silicon generations (M1/M2/M3/M4)
  • Multi-workgroup reduction for large datasets

🤖 Generated with Claude Code

Hermetian and others added 2 commits March 31, 2026 05:32
Add a native Metal compute shader backend for GPU-accelerated histogram
construction on Apple Silicon Macs. This addresses the gap left by the
deprecated OpenCL backend, which crashes on Apple Silicon (lightgbm-org#6189).

The Metal backend follows the same architecture as the existing OpenCL
GPU backend (histogram-only acceleration; split finding stays on CPU),
but is significantly simpler due to Apple Silicon's unified memory —
no pinned buffers, no async PCIe transfers needed.

New files:
- metal_tree_learner.h/.mm: MetalTreeLearner extending SerialTreeLearner
- metal/histogram{16,64,256}.metal: Metal compute kernels ported from OCL

Modified:
- CMakeLists.txt: USE_METAL option, Metal framework linking, metallib build
- config.h/config.cpp: device_type="metal" support
- tree_learner.cpp: factory method for Metal learner
- parallel/linear learner files: MetalTreeLearner template instantiations

Build: cmake -DUSE_METAL=ON .. (macOS only)
Usage: lgb.train({..., 'device': 'metal'})

Tested on Apple M4 Max with datasets up to 50K rows, all three kernel
variants (max_bin 16/64/256). Max prediction diff vs CPU: <5e-6.

Known limitation: currently uses single workgroup per feature (POWER=0)
due to Metal's lack of cross-threadgroup memory synchronization within
a single dispatch. Multi-workgroup support will follow via a two-pass
reduction kernel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add reduce_histogram{16,64,256} kernels for future multi-workgroup support
- Strip within-kernel cross-threadgroup sync from multi-WG path (unreliable on Metal)
- Keep single-workgroup (POWER=0) as default for guaranteed correctness
- Fix buffer index mismatch: hessians=6, const_hessian=7, output=8, sync=9, hist=10
- Add comprehensive test suite (test_metal.py): 14 tests covering binary, regression,
  multiclass, all 3 kernel variants, scalability to 10K rows, bagging, constant hessian
- All 14 tests pass on Apple M4 Max

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@discordwell
Copy link
Copy Markdown
Author

Closing — opened prematurely, needs more work on multi-workgroup reduction before it's ready for review.

@jameslamb
Copy link
Copy Markdown
Member

Replaced by #7215, locking this.

@lightgbm-org lightgbm-org locked as resolved and limited conversation to collaborators Apr 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants