Skip to content

cpu: aarch64: enable ACL's inner-product for BF16#5024

Merged
jondea merged 1 commit intouxlfoundation:mainfrom
fadara01:enable_bf16_ip
Apr 20, 2026
Merged

cpu: aarch64: enable ACL's inner-product for BF16#5024
jondea merged 1 commit intouxlfoundation:mainfrom
fadara01:enable_bf16_ip

Conversation

@fadara01
Copy link
Copy Markdown
Contributor

@fadara01 fadara01 commented Apr 15, 2026

Description

cpu: aarch64: enable ACL's inner-product for BF16

This benchdnn repro with ACL built with ARM-software/ComputeLibrary#1279 now goes to ACL which is > 100x faster:

ONEDNN_VERBOSE=all ./tests/benchdnn/benchdnn --ip --mode=C --dir=FWD_I  --bia-dt=bf16  --dt=bf16 mb1024ic1024oc1024

This PR + ARM-software/ComputeLibrary#1279 fix: pytorch/pytorch#180447

Fixes # (github issue)

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

RFC PR

  • Does RFC document follow the template?
  • Have you added a link to the rendered document?

@fadara01 fadara01 requested a review from a team as a code owner April 15, 2026 14:21
@github-actions github-actions Bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Apr 15, 2026
@fadara01
Copy link
Copy Markdown
Contributor Author

cc: @jondea @Sqvid

@fadara01 fadara01 changed the title cpu: aarch64: enable BF16 inner-product cpu: aarch64: enable ACL's inner-product for BF16 Apr 15, 2026
@aditew01
Copy link
Copy Markdown
Contributor

do you need to upgrade ACL version once this is merged? ARM-software/ComputeLibrary#1279

@fadara01
Copy link
Copy Markdown
Contributor Author

fadara01 commented Apr 15, 2026

do you need to upgrade ACL version once this is merged? ARM-software/ComputeLibrary#1279

Nope, I think we can just merge this without updating ACL version - in this case oneDNN will just bail out a bit later when doing ACL validate.

@aditew01
Copy link
Copy Markdown
Contributor

Nope, I think we can just merge this without updating ACL version - in this case oneDNN will just bail out a bit later when doing ACL validate.

Yeah, i was referring to this:

This benchdnn repro now goes to ACL which is > 100x faster

@fadara01
Copy link
Copy Markdown
Contributor Author

@aditew01 - I see your point, I updated the description.

@fadara01
Copy link
Copy Markdown
Contributor Author

I assume we have CI here to test oneDNN main vs ACL main? In which case let's wait for the ACL PR ARM-software/ComputeLibrary#1279 to hit main before we merge this.

&& attr()->has_default_values(
smask_t::post_ops | smask_t::fpmath_mode, f32);
const bool is_bf16_ok = expect_data_types(bf16, bf16, bf16, bf16, undef)
&& attr()->has_default_values(smask_t::post_ops, bf16);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this extra check is needed here / is covered in ACL.

Suggested change
&& attr()->has_default_values(smask_t::post_ops, bf16);
&& attr()->has_default_values(smask_t::post_ops, bf16) && platform::has_data_type_support(data_type::bf16);

Copy link
Copy Markdown
Contributor Author

@fadara01 fadara01 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure tbh, I think most of these checks are redundant anyways because the same checks will happen by calls to validate / has_opt_impl across ACL interfaces from CpuFullyConnected all the way to arm_gemm.

We have c6g CI here, we can test this in CI once the ACL PR is merged

@jondea
Copy link
Copy Markdown
Contributor

jondea commented Apr 16, 2026

I assume we have CI here to test oneDNN main vs ACL main

I don't think we have this currently

@fadara01
Copy link
Copy Markdown
Contributor Author

I don't think we have this currently

okay, I'll run the nightly test suite locally with both the oneDNN and ACL PRs.

@fadara01
Copy link
Copy Markdown
Contributor Author

OK, I built:

then I ran the full nightly suite (246 tests) with ctest on Neoverse-V2 and everything passed

100% tests passed, 0 tests failed out of 246

@jondea jondea merged commit 8318721 into uxlfoundation:main Apr 20, 2026
25 checks passed
@Sqvid
Copy link
Copy Markdown
Contributor

Sqvid commented Apr 20, 2026

@fadara01 If you want this change in the oneDNN 3.12 release then please raise a backport PR.

@fadara01
Copy link
Copy Markdown
Contributor Author

@Sqvid - I raised this backport PR against 3.12: #5061

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor BF16 torch.compile + Freezing perf on AArch64 CPUs

4 participants