Skip to content

Filter trace events with corrupted CUPTI durations in _compress_df#325

Open
firoz1905 wants to merge 2 commits intofacebookresearch:mainfrom
firoz1905:export-D98705186
Open

Filter trace events with corrupted CUPTI durations in _compress_df#325
firoz1905 wants to merge 2 commits intofacebookresearch:mainfrom
firoz1905:export-D98705186

Conversation

@firoz1905
Copy link
Copy Markdown
Contributor

@firoz1905 firoz1905 commented Mar 30, 2026

Summary:
Some GPU traces have corrupted timestamps from CUPTI, producing nonsensical
event durations (e.g., 236 years). These corrupted values propagate through
HTA's DataFrame and skew all downstream analysis (GPU hours, iteration times,
comm/IO breakdowns).

This diff adds an acceptance check in _compress_df(), the single chokepoint
that all trace parser backends (json, ijson, ijson_batched) flow through.
Events with negative or excessively large durations are dropped at parse time,
before any downstream consumer sees them.

Threshold: 7 days (604,800,000,000 us)

Query over last 30 days on ai_infra.gpu_trace_stats:

Duration Bucket Actual Range (ms) Trace Count
0-1s 0 - 1,000 13,404,751
1s-10s 1,000 - 10,000 67,420,645
10s-1min 10,000 - 60,000 7,364,988
1min-1hr 60,000 - 3,503,332 792,591
1hr-1day 4,007,737 - 53,174,381 5
1day-7days 201,830,019 - 286,093,774 2
7days-1year 610,791,801 - 29,085,041,461 14
1year+ 55,971,542,156 - 9,223,372,033,022 86

99.999% of traces are under 1 hour (max 3.5M ms). Everything above 7 days
is clearly corrupted (max is 9.2e15 ms = INT64_MAX/1000, classic CUPTI overflow).

Related: D96421400 (downstream fix in Durin post-processing)

Differential Revision: D98705186

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 30, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 30, 2026

@firoz1905 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98705186.

Summary:
Pull Request resolved: facebookresearch#324

OSS HTA uses ijson with the yajl C backend for memory-efficient streaming
of large traces. However, yajl has a signed int64 limit and strict character
validation that causes parsing failures on traces with:
- Large integers (e.g. Comms Id values exceeding int64 max)
- Invalid UTF-8 bytes
- Unescaped quotes from old Kineto versions

Internal HTA does not hit these issues because it uses Python json.loads
(arbitrary precision integers) with a strip_invalid_bytes fallback.

This diff adds two error handling mechanisms to OSS HTA:

1. parse_trace_dict: logs a clear error and raises on UnicodeDecodeError,
   telling the user the trace file is corrupted.

2. parse_trace_dataframe: wraps ijson parsing in try/except and falls
   back to the JSON backend when yajl fails. This preserves ijson
   memory efficiency for normal traces while gracefully handling
   traces with valid JSON that yajl cannot parse (e.g. large integers).

Differential Revision: D98693551
Summary:
Some GPU traces have corrupted timestamps from CUPTI, producing nonsensical
event durations (e.g., 236 years). These corrupted values propagate through
HTA's DataFrame and skew all downstream analysis (GPU hours, iteration times,
comm/IO breakdowns).

This diff adds an acceptance check in `_compress_df()`, the single chokepoint
that all trace parser backends (json, ijson, ijson_batched) flow through.
Events with negative or excessively large durations are dropped at parse time,
before any downstream consumer sees them.

**Threshold: 7 days (604,800,000,000 us)**

Query over last 30 days on `ai_infra.gpu_trace_stats`:

| Duration Bucket | Actual Range (ms)                        | Trace Count |
|-----------------|------------------------------------------|-------------|
| 0-1s            | 0 - 1,000                                |  13,404,751 |
| 1s-10s          | 1,000 - 10,000                           |  67,420,645 |
| 10s-1min        | 10,000 - 60,000                          |   7,364,988 |
| 1min-1hr        | 60,000 - 3,503,332                       |     792,591 |
| 1hr-1day        | 4,007,737 - 53,174,381                   |           5 |
| 1day-7days      | 201,830,019 - 286,093,774                |           2 |
| 7days-1year     | 610,791,801 - 29,085,041,461             |          14 |
| 1year+          | 55,971,542,156 - 9,223,372,033,022       |          86 |

99.999% of traces are under 1 hour (max 3.5M ms). Everything above 7 days
is clearly corrupted (max is 9.2e15 ms = INT64_MAX/1000, classic CUPTI overflow).

Related: D96421400 (downstream fix in Durin post-processing)

Differential Revision: D98705186
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant