feat: Metal GPU backend for Apple Silicon#7214
Closed
discordwell wants to merge 2 commits intolightgbm-org:masterfrom
Closed
feat: Metal GPU backend for Apple Silicon#7214discordwell wants to merge 2 commits intolightgbm-org:masterfrom
discordwell wants to merge 2 commits intolightgbm-org:masterfrom
Conversation
Add a native Metal compute shader backend for GPU-accelerated histogram construction on Apple Silicon Macs. This addresses the gap left by the deprecated OpenCL backend, which crashes on Apple Silicon (lightgbm-org#6189). The Metal backend follows the same architecture as the existing OpenCL GPU backend (histogram-only acceleration; split finding stays on CPU), but is significantly simpler due to Apple Silicon's unified memory — no pinned buffers, no async PCIe transfers needed. New files: - metal_tree_learner.h/.mm: MetalTreeLearner extending SerialTreeLearner - metal/histogram{16,64,256}.metal: Metal compute kernels ported from OCL Modified: - CMakeLists.txt: USE_METAL option, Metal framework linking, metallib build - config.h/config.cpp: device_type="metal" support - tree_learner.cpp: factory method for Metal learner - parallel/linear learner files: MetalTreeLearner template instantiations Build: cmake -DUSE_METAL=ON .. (macOS only) Usage: lgb.train({..., 'device': 'metal'}) Tested on Apple M4 Max with datasets up to 50K rows, all three kernel variants (max_bin 16/64/256). Max prediction diff vs CPU: <5e-6. Known limitation: currently uses single workgroup per feature (POWER=0) due to Metal's lack of cross-threadgroup memory synchronization within a single dispatch. Multi-workgroup support will follow via a two-pass reduction kernel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add reduce_histogram{16,64,256} kernels for future multi-workgroup support
- Strip within-kernel cross-threadgroup sync from multi-WG path (unreliable on Metal)
- Keep single-workgroup (POWER=0) as default for guaranteed correctness
- Fix buffer index mismatch: hessians=6, const_hessian=7, output=8, sync=9, hist=10
- Add comprehensive test suite (test_metal.py): 14 tests covering binary, regression,
multiclass, all 3 kernel variants, scalability to 10K rows, bagging, constant hessian
- All 14 tests pass on Apple M4 Max
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Closing — opened prematurely, needs more work on multi-workgroup reduction before it's ready for review. |
Member
|
Replaced by #7215, locking this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a native Metal compute shader backend for GPU-accelerated histogram construction on Apple Silicon Macs. This fills the gap left by the deprecated OpenCL backend which crashes on Apple Silicon (#6189).
device_type="metal"option for training on Apple Silicon GPUsDesign
Follows the existing OpenCL GPU backend architecture:
MetalTreeLearnerextendsSerialTreeLearner— only histogram construction on GPUTest results (Apple M4 Max)
Known limitations
gpu_use_dp).Build
Usage
Test plan
🤖 Generated with Claude Code