[GPU] a non-blocking, profiler-based verbose mode for execution time tracking#4187
[GPU] a non-blocking, profiler-based verbose mode for execution time tracking#4187avmanerikar wants to merge 14 commits intomainfrom
Conversation
2f47ef5 to
6e25fb3
Compare
There is a similar conflict with profiling API: https://uxlfoundation.github.io/oneDNN/dev_guide_experimental.html#onednn-experimental-profiling making usage of verbose functionality and profiling API exclusive. I think this could be avoided by introducing a new @avmanerikar Do you see any issues with that? |
6e25fb3 to
48a2890
Compare
48a2890 to
5f783ba
Compare
The trouble with having multiple profiler objects is that they will be tracking the same information and utilizing the same APIs throughout the stream lifetime. |
@avmanerikar Thanks for the update. Is concurrent access from the event callback and from profiling API a concern here? |
This would be a concern if the profiler is reset by the experimental API while the verbose profiler is yet to print the timing data. But since the verbose profiler calculates and prints the timing data during the course of primitive execution, this is unlikely to happen. |
That makes sense. Alas, such rare issues are usually the hardest to debug. Can we somehow make this scenario to "never happen" instead of "unlikely to happen"? |
5f783ba to
d730219
Compare
d730219 to
473b235
Compare
|
make test |
23eb676 to
ce0cb96
Compare
43bb3ce to
75556e1
Compare
09bdebb to
8396246
Compare
|
make test |
8396246 to
f247d05
Compare
f247d05 to
c99e5f3
Compare
|
make test |
| xpu::ocl::event_t::from(out_dep).events = {event}; | ||
| } | ||
|
|
||
| ocl_stream->profiler().register_event( |
There was a problem hiding this comment.
Random place: I didn't find if events_ are ever cleared during the stream lifetime. It might be a problem for long runs with enabled verbose - we need some logic to clear up synced events to limit memory usage growth.
There was a problem hiding this comment.
It is possible to clear events_ as part of before_exec_hook() since the callbacks operate on the event snapshot and are not affected by changes to events_. But doing so affects concurrent use of the profiler with the benchdnn perf mode and with the experimental profiler. In both cases, the profiling info is collected post-execution and will require all the events_ be retained.
|
make test |
|
Closing in favor of alternative polling-based approach in PR #5102 - the method relies on a periodic event polling to record and log profiling info. The motivation to using this approach is to have a unified implementation for the asynchronous mode that does not rely on runtime-specific callback APIs. The implementation addresses points that the callback-based approach fails to triage:
Failing reproducer for SYCL Graph:
|
Description
The PR implements a non-blocking verbose profiling to record and print primitive execution times without relying on blocking, host-side
stream.wait()calls. The implementation is aimed at triaging two issues associated with the previous blocked approach and highlighted in MFDNN-12088:(i) timing inaccuracies arising from host-to-device synchronization overheads and
(ii) performance discrepancies related to blocking
stream.wait()calls.Following the discussions in the RFC (PR #3393) and the PoC implementation (PR #3055), the PR implements the non-blocking verbose by leveraging the
stream_profilerfunctionality to obtain device measured times for primitive execution and using asynchronous callback to print profiling info in a non-blocking manner. The implementation proposes the following changes to the verbose profiling:DNNL_VERBOSE=profile_execfor GPU streams.Addresses MFDNN-12088.
Considerations
DNNL_DISABLE_ASYNC_VERBOSEto default to a blocking implementation for all cases.Summary of Changes:
commit (0af87f0 ... a780588): defines API to enable asynchronous mode for supported runtimes when verbose profiling is enabled (
ONEDNN_VERBOSE=profile_execorONEDNN_VERBOSE=1).Adds a runtime knob
ONEDNN_VERBOSE_USE_SYNCto force disable the asynchronous mode during verbose logging.commit (89b0693 ... e51ee1e): Adds support for asynchronous profiling for OpenCL GPU runtimes. Defines logic for execution tracking via OpenCL event callbacks and computing device measured exec times.
commit (c4b5c0d ... 7d6bd9b): Adds support for asynchronous profiling for SYCL GPU runtimes. Defines similar logic for execution tracking using
host_task.Thanks @echeresh for the suggested optimizations.
Reproducer:
With
ONEDNN_VERBOSE_USE_SYNC=0:With
ONEDNN_VERBOSE_USE_SYNC=1:Related RFC: [link]
Related PoC and discussion: [link]
RFC Checklist: