Optimize ttrs_percentiles query#8093
Open
huydhn wants to merge 1 commit into
Open
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/api/clickhouse/ttrs_percentilesis another HUD endpoint that has been hitting ClickHouse's 32 GiB total-memory limit (MEMORY_LIMIT_EXCEEDED). The KPI view runs this against a 180-day window, and the metrics view runs it against 7–90 days withone_bucket=true.Root cause turned out to be different from #8088 / #8092: the dominant cost is the
workflow_job FINAL JOIN workflow_run FINALinside thepr_shasCTE — at 180 days it reads ~630 Mworkflow_jobrows just to recover(pr_number, sha)tuples.workflow_run.head_shais equal toworkflow_job.head_shafor the runs we care about (verified empirically: 0 mismatches over 218 K rows of a 1-day window). That makes theworkflow_jobside of the join inpr_shasredundant. Dropping it cuts the largest scan out of the query.commit_job_durationsstill needsworkflow_job, but its candidate set is bounded by merged-PR shas viaworkflow_job_by_head_sha— much smaller than the started-at window — soFINALstays there. ALIMIT 1 BY id ORDER BY _inserted_at DESCmanual-dedup variant was prototyped and rejected: it shrank memory further but added ~20 s of wall time at 180 d due to the sort.pull_request(no_inserted_atcolumn) keepsFINALbut is pre-filtered bypr.number IN (SELECT pr_number FROM pr_shas).Measurements (production ClickHouse)
Result rows are bit-identical across all four scenarios (
r_old.result_rows == r_new.result_rows).Test plan
hud.pytorch.org/api/clickhouse/ttrs_percentilesafter rollout — error rate should drop to ~0%.