[GPU] a non-blocking, profiler-based verbose mode for execution time tracking by avmanerikar · Pull Request #4187 · uxlfoundation/oneDNN

avmanerikar · 2025-10-22T17:58:20Z

Description

The PR implements a non-blocking verbose profiling to record and print primitive execution times without relying on blocking, host-side stream.wait() calls. The implementation is aimed at triaging two issues associated with the previous blocked approach and highlighted in MFDNN-12088:
(i) timing inaccuracies arising from host-to-device synchronization overheads and
(ii) performance discrepancies related to blocking stream.wait() calls.

Following the discussions in the RFC (PR #3393) and the PoC implementation (PR #3055), the PR implements the non-blocking verbose by leveraging the stream_profiler functionality to obtain device measured times for primitive execution and using asynchronous callback to print profiling info in a non-blocking manner. The implementation proposes the following changes to the verbose profiling:

enable stream profiling capabilities when DNNL_VERBOSE=profile_exec for GPU streams.
calculate primitive execution times by aggregating the timing data obtained from the stream profiler at the end of primtive execution.
use asynchronous callbacks to print verbose profiling info upon output event completion.

Addresses MFDNN-12088.

Considerations

The non-blocking verbose mode is enabled for OpenCL and SYCL GPU streams only, defaulting to a blocked implementation for CPU streams and other runtimes.
The non-blocking verbose mode is not supported for out-of-order queues and falls back to the blocked mode for such cases.
The asynchronous mode can be user-disabled using DNNL_DISABLE_ASYNC_VERBOSE to default to a blocking implementation for all cases.

Summary of Changes:

commit (0af87f0 ... a780588): defines API to enable asynchronous mode for supported runtimes when verbose profiling is enabled (ONEDNN_VERBOSE=profile_exec or ONEDNN_VERBOSE=1).
Adds a runtime knob ONEDNN_VERBOSE_USE_SYNC to force disable the asynchronous mode during verbose logging.
commit (89b0693 ... e51ee1e): Adds support for asynchronous profiling for OpenCL GPU runtimes. Defines logic for execution tracking via OpenCL event callbacks and computing device measured exec times.
commit (c4b5c0d ... 7d6bd9b): Adds support for asynchronous profiling for SYCL GPU runtimes. Defines similar logic for execution tracking using host_task.

Thanks @echeresh for the suggested optimizations.

Reproducer:

DNNL_VERBOSE=profile_exec ./examples/primitives-matmul-cpp gpu

With `ONEDNN_VERBOSE_USE_SYNC=0`:

onednn_verbose,v1,info,oneDNN v3.13.0 (commit 0fadac14800ba792a9d991b93b4917509eada63b)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:224
onednn_verbose,v1,info,cpu,isa:Intel AVX10.1 and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,v1,info,gpu,runtime:OpenCL
onednn_verbose,v1,info,gpu,engine,opencl device count:2 
onednn_verbose,v1,info,gpu,engine,0,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src:f32::blocked:abc::f0 wei:f32::blocked:abc::f0 bia:f32::blocked:abc::f0_mask4 dst:f32::blocked:abc::f0,attr-post-ops:eltwise_relu,,3x128x256:3x256x512,0.01904
Example passed on GPU.

With `ONEDNN_VERBOSE_USE_SYNC=1`:

onednn_verbose,v1,info,oneDNN v3.13.0 (commit 0fadac14800ba792a9d991b93b4917509eada63b)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:224
onednn_verbose,v1,info,cpu,isa:Intel AVX10.1 and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,v1,info,gpu,runtime:OpenCL
onednn_verbose,v1,info,gpu,engine,opencl device count:2 
onednn_verbose,v1,info,gpu,engine,0,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src:f32::blocked:abc::f0 wei:f32::blocked:abc::f0 bia:f32::blocked:abc::f0_mask4 dst:f32::blocked:abc::f0,attr-post-ops:eltwise_relu,,3x128x256:3x256x512,0.473145
Example passed on GPU.

Related RFC: [link]
Related PoC and discussion: [link]

RFC Checklist:

Have you published an RFC for the new feature?
Was the RFC approved?
Have you added relevant tests?

echeresh · 2025-10-24T22:04:33Z

Because benchdnn also uses the stream_profiler functionality in perf mode, the verbose profiler is disabled when benchdnn is run in perf mode to record aggregate timing info.

There is a similar conflict with profiling API: https://uxlfoundation.github.io/oneDNN/dev_guide_experimental.html#onednn-experimental-profiling making usage of verbose functionality and profiling API exclusive.

I think this could be avoided by introducing a new stream_profiler_t object (dedicated only for verbose-related timing) and extending get_profiling_data() and similar APIs. This seems to be a better option to me as we will be able to keep benchmarking, verbose functionality and profiling API working independently, without extra cross-conditions.

@avmanerikar Do you see any issues with that?

avmanerikar · 2025-10-27T17:40:19Z

@avmanerikar Do you see any issues with that?

The trouble with having multiple profiler objects is that they will be tracking the same information and utilizing the same APIs throughout the stream lifetime.
The conflicting behavior was due to the verbose profiling data being reset after primitive execution leaving an empty slate for profiling queries post-execution.
I have adjusted the design to removing the resetting action which allows it to be used with the benchdnn perf mode and the experimental profiler.

echeresh · 2025-10-28T00:35:45Z

I have adjusted the design to removing the resetting action which allows it to be used with the benchdnn perf mode and the experimental profiler.

@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here?

avmanerikar · 2025-10-29T18:54:34Z

@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here?

This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen.

echeresh · 2025-10-29T20:47:52Z

This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen.

That makes sense. Alas, such rare issues are usually the hardest to debug. Can we somehow make this scenario to "never happen" instead of "unlikely to happen"?

avmanerikar · 2025-12-17T00:36:28Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

avmanerikar · 2026-04-22T22:00:20Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

avmanerikar · 2026-04-28T22:29:15Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

echeresh · 2026-04-28T23:52:21Z

+            xpu::ocl::event_t::from(out_dep).events = {event};
+        }
+
        ocl_stream->profiler().register_event(


Random place: I didn't find if events_ are ever cleared during the stream lifetime. It might be a problem for long runs with enabled verbose - we need some logic to clear up synced events to limit memory usage growth.

It is possible to clear events_ as part of before_exec_hook() since the callbacks operate on the event snapshot and are not affected by changes to events_. But doing so affects concurrent use of the profiler with the benchdnn perf mode and with the experimental profiler. In both cases, the profiling info is collected post-execution and will require all the events_ be retained.

avmanerikar · 2026-04-29T02:47:18Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

avmanerikar · 2026-05-04T18:19:26Z

Closing in favor of alternative polling-based approach in PR #5102 - the method relies on a periodic event polling to record and log profiling info. The motivation to using this approach is to have a unified implementation for the asynchronous mode that does not rely on runtime-specific callback APIs.

The implementation addresses points that the callback-based approach fails to triage:

Event-polling does not rely on runtime-specific callback API (required for supporting L0 runtime).
The callback method does not triage MFDNN-14331 for SYCL graphs.

Failing reproducer for SYCL Graph:

DNNL_VERBOSE=1 .\tests\benchdnn\benchdnn.exe --eltwise --engine=gpu --execution-mode=graph 1

Extension to CPU Threadpool is difficult with callback-based method.

avmanerikar requested review from a team as code owners October 22, 2025 17:58

avmanerikar marked this pull request as draft October 22, 2025 17:58

github-actions Bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Oct 22, 2025

avmanerikar changed the title ~~[WIP][GPU] common: verbose: a non-blocking, profiler-based verbose mode for execution time tracking~~ [WIP][GPU] a non-blocking, profiler-based verbose mode for execution time tracking Oct 22, 2025

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch 2 times, most recently from 2f47ef5 to 6e25fb3 Compare October 22, 2025 19:47

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 6e25fb3 to 48a2890 Compare October 27, 2025 16:59

github-actions Bot added the documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc label Oct 27, 2025

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 48a2890 to 5f783ba Compare October 27, 2025 17:32

github-actions Bot removed component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch labels Oct 27, 2025

avmanerikar mentioned this pull request Oct 27, 2025

common: verbose: asynchronous verbose mode for execution time tracking #3055

Closed

3 tasks

avmanerikar marked this pull request as ready for review October 27, 2025 19:21

avmanerikar requested a review from a team as a code owner October 27, 2025 19:21

avmanerikar changed the title ~~[WIP][GPU] a non-blocking, profiler-based verbose mode for execution time tracking~~ [GPU] a non-blocking, profiler-based verbose mode for execution time tracking Oct 27, 2025

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 5f783ba to d730219 Compare December 3, 2025 19:52

github-actions Bot added the third_party label Dec 3, 2025

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from d730219 to 473b235 Compare December 17, 2025 00:30

github-actions Bot removed the third_party label Dec 17, 2025

echeresh reviewed Feb 26, 2026

View reviewed changes

Comment thread src/common/stream.cpp

Comment thread src/gpu/intel/ocl/stream.cpp Outdated

Comment thread src/xpu/ocl/stream_impl.hpp Outdated

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 23eb676 to ce0cb96 Compare March 16, 2026 23:59

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 43bb3ce to 75556e1 Compare April 6, 2026 17:59

avmanerikar mentioned this pull request Apr 8, 2026

rfcs: proposal for an asynchronous verbose mode #3393

Open

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch 3 times, most recently from 09bdebb to 8396246 Compare April 22, 2026 21:46

avmanerikar added 10 commits April 28, 2026 13:14

common: verbose: add flag to force disable async verbose mode

0af87f0

common: stream: define async verbose profiler api for streams

4cb9b9c

common: stream: enable stream profiler when in verbose profiling mode

9e8b0d0

gpu: ocl: enable verbose profiler for opencl interops

4817b3f

gpu: sycl: enable verbose profiler for sycl interops

52e61f3

common: stream: add non-blocking profiler in verbose exec mode

a780588

gpu: ocl: add api to get aggregate exec time

89b0693

gpu: ocl: add async verbose profiling for opencl streams

f37c137

gpu: ocl: insert async event tracker for verbose profiling

34fa2d9

gpu: ocl: add callback waiting logic for stream profiler

e51ee1e

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 8396246 to f247d05 Compare April 28, 2026 20:22

avmanerikar added 3 commits April 28, 2026 13:41

xpu: sycl: profiler: add support to compute total exec time

c4b5c0d

gpu: sycl: add async verbose profiler for sycl streams

7d6bd9b

doc: verbose: add section on asynchronous profiling

c99e5f3

avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from f247d05 to c99e5f3 Compare April 28, 2026 20:43

echeresh reviewed Apr 29, 2026

View reviewed changes

common: stream_profiler_t: fixes for events_ safety and initialization

96445c0

avmanerikar mentioned this pull request Apr 30, 2026

[WIP, do not review] Polling-based asynchronous verbose mode for execution time-tracking #5102

Draft

avmanerikar closed this May 4, 2026

Conversation

avmanerikar commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Considerations

Summary of Changes:

Reproducer:

With ONEDNN_VERBOSE_USE_SYNC=0:

With ONEDNN_VERBOSE_USE_SYNC=1:

RFC Checklist:

Uh oh!

echeresh commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avmanerikar commented Oct 27, 2025

Uh oh!

echeresh commented Oct 28, 2025

Uh oh!

avmanerikar commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echeresh commented Oct 29, 2025

Uh oh!

avmanerikar commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

avmanerikar commented Apr 22, 2026

Uh oh!

avmanerikar commented Apr 28, 2026

Uh oh!

Uh oh!

echeresh Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avmanerikar Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

avmanerikar commented Apr 29, 2026

Uh oh!

avmanerikar commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failing reproducer for SYCL Graph:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avmanerikar commented Oct 22, 2025 •

edited

Loading

With `ONEDNN_VERBOSE_USE_SYNC=0`:

With `ONEDNN_VERBOSE_USE_SYNC=1`:

echeresh commented Oct 24, 2025 •

edited

Loading

avmanerikar commented Oct 29, 2025 •

edited

Loading

echeresh Apr 28, 2026 •

edited

Loading

avmanerikar commented May 4, 2026 •

edited

Loading