Add database-parallel flat search for few-query workloads#5000
Add database-parallel flat search for few-query workloads#5000ivan-digital wants to merge 2 commits intofacebookresearch:mainfrom
Conversation
When the number of queries (nx) is small relative to the thread count, the existing query-parallel approach underutilizes CPU cores — each thread processes one query against the entire database. This change adds a database-parallel code path that instead gives each thread a disjoint slice of the database to scan, using single-threaded BLAS and per-thread heaps that are merged after the parallel region. The new path activates automatically when: - nx < omp_get_max_threads() - ny >= max(10000, nthreads * blas_database_bs) - No IDSelector is active Benchmarks (12 threads, d=128): IndexFlatIP, ny=1M, nx=4: 56.7ms → 8.4ms (6.7x) IndexFlatIP, ny=1M, nx=1: 14.1ms → 5.0ms (2.8x) IndexFlatL2, ny=1M, nx=4: 56.7ms → 14.0ms (4.0x) IndexFlatL2, ny=1M, nx=1: 14.1ms → 9.5ms (1.5x) Addresses facebookresearch#4121.
|
Hi @ivan-digital! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
- Move heap initialization inside the parallel region so each thread only initializes its own heaps - Parallelize the merge phase over queries with omp parallel for - Trim verbose comments
Summary
When the number of queries (
nx) is smaller than the available thread count, the current flat search underutilizes CPU cores — each thread scans the entire database for a single query. This PR adds a database-parallel code path that divides the database across threads instead, with per-thread BLAS calls and heap merging.nx < omp_get_max_threads()and the database is large enoughIDSelectoris active or the database is too small to benefitMETRIC_INNER_PRODUCT(CMin) andMETRIC_L2(CMax)Benchmarks
12 threads,
d=128,k=10:No regression for
nx >= nthreads(existing path is unchanged).Motivation
Addresses #4121. The original issue demonstrated that for few-query workloads over large databases, parallelizing over database segments instead of queries can yield multi-fold speedups. This is especially relevant for serving workloads where queries arrive one at a time.
Test plan
TestIndexFlattests pass (15/15)test_index.pypasses (47/47 including 9 new tests)test_search_params.py+test_index_composite.pypass (81/81)TestDbParallelSearchcovers: IP, L2, k=1, k=200, single query, few queries, thread scaling consistency