Skip to content

Optimize ttrs_percentiles query#8093

Open
huydhn wants to merge 1 commit into
mainfrom
optimize-ttrs-percentiles
Open

Optimize ttrs_percentiles query#8093
huydhn wants to merge 1 commit into
mainfrom
optimize-ttrs-percentiles

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 16, 2026

Summary

/api/clickhouse/ttrs_percentiles is another HUD endpoint that has been hitting ClickHouse's 32 GiB total-memory limit (MEMORY_LIMIT_EXCEEDED). The KPI view runs this against a 180-day window, and the metrics view runs it against 7–90 days with one_bucket=true.

Root cause turned out to be different from #8088 / #8092: the dominant cost is the workflow_job FINAL JOIN workflow_run FINAL inside the pr_shas CTE — at 180 days it reads ~630 M workflow_job rows just to recover (pr_number, sha) tuples.

workflow_run.head_sha is equal to workflow_job.head_sha for the runs we care about (verified empirically: 0 mismatches over 218 K rows of a 1-day window). That makes the workflow_job side of the join in pr_shas redundant. Dropping it cuts the largest scan out of the query.

commit_job_durations still needs workflow_job, but its candidate set is bounded by merged-PR shas via workflow_job_by_head_sha — much smaller than the started-at window — so FINAL stays there. A LIMIT 1 BY id ORDER BY _inserted_at DESC manual-dedup variant was prototyped and rejected: it shrank memory further but added ~20 s of wall time at 180 d due to the sort.

pull_request (no _inserted_at column) keeps FINAL but is pre-filtered by pr.number IN (SELECT pr_number FROM pr_shas).

Measurements (production ClickHouse)

Scenario wall (OLD → NEW) mem (OLD → NEW) read_rows (OLD → NEW)
7 d weekly (kpi) 2.9 s → 2.8 s 3.64 → 3.39 GB 55.5 M → 41.1 M
30 d weekly (kpi) 8.6 s → 6.8 s 3.01 → 2.87 GB 190.7 M → 137.9 M
180 d weekly (kpi) 28.3 s → 21.8 s 4.31 → 3.54 GB 632.5 M → 404.4 M
90 d one_bucket p90 (metrics) 18.2 s → 16.3 s 3.65 → 3.18 GB 418.7 M → 281.2 M

Result rows are bit-identical across all four scenarios (r_old.result_rows == r_new.result_rows).

Test plan

  • OLD vs NEW return bit-identical percentile rows at 7 d / 30 d / 180 d weekly and 90 d one_bucket p90.
  • Validate workflow_run.head_sha == workflow_job.head_sha assumption (0 mismatches in 218 K-row sample).
  • Wall time and peak memory both improved across all scenarios.
  • Monitor hud.pytorch.org/api/clickhouse/ttrs_percentiles after rollout — error rate should drop to ~0%.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
torchci Ready Ready Preview May 16, 2026 6:22am

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant