Skip to content

rfcs: proposal for an asynchronous verbose mode#3393

Open
avmanerikar wants to merge 1 commit intorfcsfrom
amanerik/rfcs/async-verbose-mode
Open

rfcs: proposal for an asynchronous verbose mode#3393
avmanerikar wants to merge 1 commit intorfcsfrom
amanerik/rfcs/async-verbose-mode

Conversation

@avmanerikar
Copy link
Copy Markdown
Contributor

@avmanerikar avmanerikar commented Jun 4, 2025

Link to rendered document: [link]
Link to PoC implementation: [link]

Addresses MFDNN-13603 and MFDNN-12088.

@avmanerikar avmanerikar requested a review from a team as a code owner June 4, 2025 18:51
@github-actions github-actions Bot added the RFC A design document label Jun 4, 2025
@avmanerikar avmanerikar force-pushed the amanerik/rfcs/async-verbose-mode branch 2 times, most recently from 2631f08 to 4967b8c Compare June 16, 2025 17:51
@avmanerikar avmanerikar force-pushed the amanerik/rfcs/async-verbose-mode branch from 4967b8c to ce5811f Compare June 25, 2025 18:40
@avmanerikar avmanerikar force-pushed the amanerik/rfcs/async-verbose-mode branch from ce5811f to e1f1c58 Compare July 7, 2025 19:56
@avmanerikar avmanerikar force-pushed the amanerik/rfcs/async-verbose-mode branch from e1f1c58 to 5794dea Compare August 6, 2025 00:38
@avmanerikar
Copy link
Copy Markdown
Contributor Author

The PoC implementation for asynchronous verbose profiling fails in the following scenario:

The asynchronous callbacks responsible for calculating and printing profiling information rely on stream_profiler_t to access event execution times. When these callbacks are delayed beyond the point where the stream is destroyed, accessing the profiler becomes impossible and can potentially trigger a segmentation fault during timing computation.

The following example reproduces this scenario for OpenCL GPU runtimes:

ONEDNN_VERBOSE_USE_SYNC=0 ONEDNN_VERBOSE=1 ./tests/benchdnn/benchdnn -v5 --engine=gpu --matmul --mode=R --repeats-per-prb=80000 128x128:128x128

The fix requires a non-blocking mechanism to retain the lifecycle of the profiler and its registered events until the callbacks complete verbose logging.

Candidate Solutions:

  • Pass a snapshot of profiler events in the callback payload - The caveat is that this approach involves mutex-locked operations.
  • Defer stream destruction until callbacks are executed - While this approach is blocking, the blocking occurs during stream destruction rather than during primitive execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

RFC A design document

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant