Optimize ttrs_percentiles query by huydhn · Pull Request #8093 · pytorch/test-infra

huydhn · 2026-05-16T06:20:47Z

Summary

/api/clickhouse/ttrs_percentiles is another HUD endpoint that has been hitting ClickHouse's 32 GiB total-memory limit (MEMORY_LIMIT_EXCEEDED). The KPI view runs this against a 180-day window, and the metrics view runs it against 7â€“90 days with one_bucket=true.

Root cause turned out to be different from #8088 / #8092: the dominant cost is the workflow_job FINAL JOIN workflow_run FINAL inside the pr_shas CTE â€” at 180 days it reads ~630 M workflow_job rows just to recover (pr_number, sha) tuples.

workflow_run.head_sha is equal to workflow_job.head_sha for the runs we care about (verified empirically: 0 mismatches over 218 K rows of a 1-day window). That makes the workflow_job side of the join in pr_shas redundant. Dropping it cuts the largest scan out of the query.

commit_job_durations still needs workflow_job, but its candidate set is bounded by merged-PR shas via workflow_job_by_head_sha â€” much smaller than the started-at window â€” so FINAL stays there. A LIMIT 1 BY id ORDER BY _inserted_at DESC manual-dedup variant was prototyped and rejected: it shrank memory further but added ~20 s of wall time at 180 d due to the sort.

pull_request (no _inserted_at column) keeps FINAL but is pre-filtered by pr.number IN (SELECT pr_number FROM pr_shas).

Measurements (production ClickHouse)

Scenario	wall (OLD â†’ NEW)	mem (OLD â†’ NEW)	read_rows (OLD â†’ NEW)
7 d weekly (kpi)	2.9 s â†’ 2.8 s	3.64 â†’ 3.39 GB	55.5 M â†’ 41.1 M
30 d weekly (kpi)	8.6 s â†’ 6.8 s	3.01 â†’ 2.87 GB	190.7 M â†’ 137.9 M
180 d weekly (kpi)	28.3 s â†’ 21.8 s	4.31 â†’ 3.54 GB	632.5 M â†’ 404.4 M
90 d one_bucket p90 (metrics)	18.2 s â†’ 16.3 s	3.65 â†’ 3.18 GB	418.7 M â†’ 281.2 M

Result rows are bit-identical across all four scenarios (r_old.result_rows == r_new.result_rows).

Test plan

OLD vs NEW return bit-identical percentile rows at 7 d / 30 d / 180 d weekly and 90 d one_bucket p90.
Validate workflow_run.head_sha == workflow_job.head_sha assumption (0 mismatches in 218 K-row sample).
Wall time and peak memory both improved across all scenarios.
Monitor hud.pytorch.org/api/clickhouse/ttrs_percentiles after rollout â€” error rate should drop to ~0%.

vercel · 2026-05-16T06:20:52Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
torchci	Ready	Preview	May 16, 2026 6:22am

Optimize ttrs_percentiles query

b502610

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2026

vercel Bot deployed to Preview May 16, 2026 06:22 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ttrs_percentiles query#8093

Optimize ttrs_percentiles query#8093
huydhn wants to merge 1 commit into
mainfrom
optimize-ttrs-percentiles

huydhn commented May 16, 2026

Uh oh!

vercel Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huydhn commented May 16, 2026

Summary

Measurements (production ClickHouse)

Test plan

Uh oh!

vercel Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 16, 2026 •

edited

Loading