Skip to content

[GPU] a non-blocking, profiler-based verbose mode for execution time tracking#4187

Closed
avmanerikar wants to merge 14 commits intomainfrom
amanerik/main/async-verbose-mode-new
Closed

[GPU] a non-blocking, profiler-based verbose mode for execution time tracking#4187
avmanerikar wants to merge 14 commits intomainfrom
amanerik/main/async-verbose-mode-new

Conversation

@avmanerikar
Copy link
Copy Markdown
Contributor

@avmanerikar avmanerikar commented Oct 22, 2025

Description

The PR implements a non-blocking verbose profiling to record and print primitive execution times without relying on blocking, host-side stream.wait() calls. The implementation is aimed at triaging two issues associated with the previous blocked approach and highlighted in MFDNN-12088:
(i) timing inaccuracies arising from host-to-device synchronization overheads and
(ii) performance discrepancies related to blocking stream.wait() calls.

Following the discussions in the RFC (PR #3393) and the PoC implementation (PR #3055), the PR implements the non-blocking verbose by leveraging the stream_profiler functionality to obtain device measured times for primitive execution and using asynchronous callback to print profiling info in a non-blocking manner. The implementation proposes the following changes to the verbose profiling:

  • enable stream profiling capabilities when DNNL_VERBOSE=profile_exec for GPU streams.
  • calculate primitive execution times by aggregating the timing data obtained from the stream profiler at the end of primtive execution.
  • use asynchronous callbacks to print verbose profiling info upon output event completion.

Addresses MFDNN-12088.

Considerations

  • The non-blocking verbose mode is enabled for OpenCL and SYCL GPU streams only, defaulting to a blocked implementation for CPU streams and other runtimes.
  • The non-blocking verbose mode is not supported for out-of-order queues and falls back to the blocked mode for such cases.
  • The asynchronous mode can be user-disabled using DNNL_DISABLE_ASYNC_VERBOSE to default to a blocking implementation for all cases.

Summary of Changes:

  • commit (0af87f0 ... a780588): defines API to enable asynchronous mode for supported runtimes when verbose profiling is enabled (ONEDNN_VERBOSE=profile_exec or ONEDNN_VERBOSE=1).
    Adds a runtime knob ONEDNN_VERBOSE_USE_SYNC to force disable the asynchronous mode during verbose logging.

  • commit (89b0693 ... e51ee1e): Adds support for asynchronous profiling for OpenCL GPU runtimes. Defines logic for execution tracking via OpenCL event callbacks and computing device measured exec times.

  • commit (c4b5c0d ... 7d6bd9b): Adds support for asynchronous profiling for SYCL GPU runtimes. Defines similar logic for execution tracking using host_task.

Thanks @echeresh for the suggested optimizations.

Reproducer:

DNNL_VERBOSE=profile_exec ./examples/primitives-matmul-cpp gpu

With ONEDNN_VERBOSE_USE_SYNC=0:

onednn_verbose,v1,info,oneDNN v3.13.0 (commit 0fadac14800ba792a9d991b93b4917509eada63b)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:224
onednn_verbose,v1,info,cpu,isa:Intel AVX10.1 and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,v1,info,gpu,runtime:OpenCL
onednn_verbose,v1,info,gpu,engine,opencl device count:2 
onednn_verbose,v1,info,gpu,engine,0,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src:f32::blocked:abc::f0 wei:f32::blocked:abc::f0 bia:f32::blocked:abc::f0_mask4 dst:f32::blocked:abc::f0,attr-post-ops:eltwise_relu,,3x128x256:3x256x512,0.01904
Example passed on GPU.

With ONEDNN_VERBOSE_USE_SYNC=1:

onednn_verbose,v1,info,oneDNN v3.13.0 (commit 0fadac14800ba792a9d991b93b4917509eada63b)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:224
onednn_verbose,v1,info,cpu,isa:Intel AVX10.1 and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,v1,info,gpu,runtime:OpenCL
onednn_verbose,v1,info,gpu,engine,opencl device count:2 
onednn_verbose,v1,info,gpu,engine,0,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,name:Intel(R) Data Center GPU Max 1100,driver_version:25.18.33578,binary_kernels:enabled
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src:f32::blocked:abc::f0 wei:f32::blocked:abc::f0 bia:f32::blocked:abc::f0_mask4 dst:f32::blocked:abc::f0,attr-post-ops:eltwise_relu,,3x128x256:3x256x512,0.473145
Example passed on GPU.

Related RFC: [link]
Related PoC and discussion: [link]

RFC Checklist:

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

@avmanerikar avmanerikar requested review from a team as code owners October 22, 2025 17:58
@avmanerikar avmanerikar marked this pull request as draft October 22, 2025 17:58
@github-actions github-actions Bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Oct 22, 2025
@avmanerikar avmanerikar changed the title [WIP][GPU] common: verbose: a non-blocking, profiler-based verbose mode for execution time tracking [WIP][GPU] a non-blocking, profiler-based verbose mode for execution time tracking Oct 22, 2025
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch 2 times, most recently from 2f47ef5 to 6e25fb3 Compare October 22, 2025 19:47
@echeresh
Copy link
Copy Markdown
Contributor

echeresh commented Oct 24, 2025

Because benchdnn also uses the stream_profiler functionality in perf mode, the verbose profiler is disabled when benchdnn is run in perf mode to record aggregate timing info.

There is a similar conflict with profiling API: https://uxlfoundation.github.io/oneDNN/dev_guide_experimental.html#onednn-experimental-profiling making usage of verbose functionality and profiling API exclusive.

I think this could be avoided by introducing a new stream_profiler_t object (dedicated only for verbose-related timing) and extending get_profiling_data() and similar APIs. This seems to be a better option to me as we will be able to keep benchmarking, verbose functionality and profiling API working independently, without extra cross-conditions.

@avmanerikar Do you see any issues with that?

@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 6e25fb3 to 48a2890 Compare October 27, 2025 16:59
@github-actions github-actions Bot added the documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc label Oct 27, 2025
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 48a2890 to 5f783ba Compare October 27, 2025 17:32
@github-actions github-actions Bot removed component:graph-api Codeowner: @oneapi-src/onednn-graph component:tests Codeowner: @oneapi-src/onednn-arch labels Oct 27, 2025
@avmanerikar
Copy link
Copy Markdown
Contributor Author

@avmanerikar Do you see any issues with that?

The trouble with having multiple profiler objects is that they will be tracking the same information and utilizing the same APIs throughout the stream lifetime.
The conflicting behavior was due to the verbose profiling data being reset after primitive execution leaving an empty slate for profiling queries post-execution.
I have adjusted the design to removing the resetting action which allows it to be used with the benchdnn perf mode and the experimental profiler.

@avmanerikar avmanerikar marked this pull request as ready for review October 27, 2025 19:21
@avmanerikar avmanerikar requested a review from a team as a code owner October 27, 2025 19:21
@avmanerikar avmanerikar changed the title [WIP][GPU] a non-blocking, profiler-based verbose mode for execution time tracking [GPU] a non-blocking, profiler-based verbose mode for execution time tracking Oct 27, 2025
@echeresh
Copy link
Copy Markdown
Contributor

I have adjusted the design to removing the resetting action which allows it to be used with the benchdnn perf mode and the experimental profiler.

@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here?

@avmanerikar
Copy link
Copy Markdown
Contributor Author

avmanerikar commented Oct 29, 2025

@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here?

This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen.

@echeresh
Copy link
Copy Markdown
Contributor

This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen.

That makes sense. Alas, such rare issues are usually the hardest to debug. Can we somehow make this scenario to "never happen" instead of "unlikely to happen"?

@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 5f783ba to d730219 Compare December 3, 2025 19:52
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from d730219 to 473b235 Compare December 17, 2025 00:30
@avmanerikar
Copy link
Copy Markdown
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

Comment thread src/common/stream.cpp
Comment thread src/gpu/intel/ocl/stream.cpp Outdated
Comment thread src/xpu/ocl/stream_impl.hpp Outdated
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 23eb676 to ce0cb96 Compare March 16, 2026 23:59
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 43bb3ce to 75556e1 Compare April 6, 2026 17:59
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch 3 times, most recently from 09bdebb to 8396246 Compare April 22, 2026 21:46
@avmanerikar
Copy link
Copy Markdown
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from 8396246 to f247d05 Compare April 28, 2026 20:22
@avmanerikar avmanerikar force-pushed the amanerik/main/async-verbose-mode-new branch from f247d05 to c99e5f3 Compare April 28, 2026 20:43
@avmanerikar
Copy link
Copy Markdown
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

Comment thread src/xpu/ocl/stream_profiler.cpp
xpu::ocl::event_t::from(out_dep).events = {event};
}

ocl_stream->profiler().register_event(
Copy link
Copy Markdown
Contributor

@echeresh echeresh Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random place: I didn't find if events_ are ever cleared during the stream lifetime. It might be a problem for long runs with enabled verbose - we need some logic to clear up synced events to limit memory usage growth.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to clear events_ as part of before_exec_hook() since the callbacks operate on the event snapshot and are not affected by changes to events_. But doing so affects concurrent use of the profiler with the benchdnn perf mode and with the experimental profiler. In both cases, the profiling info is collected post-execution and will require all the events_ be retained.

Comment thread src/xpu/ocl/stream_profiler.cpp Outdated
Comment thread src/common/stream.hpp
Comment thread src/common/stream.hpp Outdated
@avmanerikar
Copy link
Copy Markdown
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
disable test_device_cpu
enable test_device_gpu

@avmanerikar
Copy link
Copy Markdown
Contributor Author

avmanerikar commented May 4, 2026

Closing in favor of alternative polling-based approach in PR #5102 - the method relies on a periodic event polling to record and log profiling info. The motivation to using this approach is to have a unified implementation for the asynchronous mode that does not rely on runtime-specific callback APIs.

The implementation addresses points that the callback-based approach fails to triage:

  • Event-polling does not rely on runtime-specific callback API (required for supporting L0 runtime).
  • The callback method does not triage MFDNN-14331 for SYCL graphs.
Failing reproducer for SYCL Graph:
DNNL_VERBOSE=1 .\tests\benchdnn\benchdnn.exe --eltwise --engine=gpu --execution-mode=graph 1
  • Extension to CPU Threadpool is difficult with callback-based method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants