Skip to content

Add database-parallel flat search for few-query workloads#5000

Open
ivan-digital wants to merge 2 commits intofacebookresearch:mainfrom
ivan-digital:fix/flat-search-parallelization
Open

Add database-parallel flat search for few-query workloads#5000
ivan-digital wants to merge 2 commits intofacebookresearch:mainfrom
ivan-digital:fix/flat-search-parallelization

Conversation

@ivan-digital
Copy link
Copy Markdown

Summary

When the number of queries (nx) is smaller than the available thread count, the current flat search underutilizes CPU cores — each thread scans the entire database for a single query. This PR adds a database-parallel code path that divides the database across threads instead, with per-thread BLAS calls and heap merging.

  • Activates automatically when nx < omp_get_max_threads() and the database is large enough
  • Falls back to the existing path when an IDSelector is active or the database is too small to benefit
  • Works for both METRIC_INNER_PRODUCT (CMin) and METRIC_L2 (CMax)

Benchmarks

12 threads, d=128, k=10:

Index ny nx Before After Speedup
FlatIP 1,000,000 4 56.7 ms 8.4 ms 6.7x
FlatIP 1,000,000 1 14.1 ms 5.0 ms 2.8x
FlatL2 1,000,000 4 56.7 ms 14.0 ms 4.0x
FlatL2 1,000,000 1 14.1 ms 9.5 ms 1.5x
FlatIP 200,000 4 11.5 ms 1.6 ms 7.0x
FlatL2 200,000 4 11.4 ms 2.8 ms 4.2x

No regression for nx >= nthreads (existing path is unchanged).

Motivation

Addresses #4121. The original issue demonstrated that for few-query workloads over large databases, parallelizing over database segments instead of queries can yield multi-fold speedups. This is especially relevant for serving workloads where queries arrive one at a time.

Test plan

  • All existing TestIndexFlat tests pass (15/15)
  • Full test_index.py passes (47/47 including 9 new tests)
  • test_search_params.py + test_index_composite.py pass (81/81)
  • New TestDbParallelSearch covers: IP, L2, k=1, k=200, single query, few queries, thread scaling consistency
  • Cross-validated against NumPy brute-force for IP and L2

When the number of queries (nx) is small relative to the thread count,
the existing query-parallel approach underutilizes CPU cores — each
thread processes one query against the entire database. This change adds
a database-parallel code path that instead gives each thread a disjoint
slice of the database to scan, using single-threaded BLAS and per-thread
heaps that are merged after the parallel region.

The new path activates automatically when:
- nx < omp_get_max_threads()
- ny >= max(10000, nthreads * blas_database_bs)
- No IDSelector is active

Benchmarks (12 threads, d=128):

  IndexFlatIP, ny=1M, nx=4: 56.7ms → 8.4ms  (6.7x)
  IndexFlatIP, ny=1M, nx=1: 14.1ms → 5.0ms  (2.8x)
  IndexFlatL2, ny=1M, nx=4: 56.7ms → 14.0ms (4.0x)
  IndexFlatL2, ny=1M, nx=1: 14.1ms → 9.5ms  (1.5x)

Addresses facebookresearch#4121.
@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Mar 27, 2026

Hi @ivan-digital!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@ivan-digital ivan-digital marked this pull request as draft March 27, 2026 21:08
@ivan-digital ivan-digital marked this pull request as ready for review March 27, 2026 21:10
@ivan-digital ivan-digital marked this pull request as draft March 27, 2026 21:13
@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Mar 27, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed label Mar 27, 2026
- Move heap initialization inside the parallel region so each thread
  only initializes its own heaps
- Parallelize the merge phase over queries with omp parallel for
- Trim verbose comments
@ivan-digital ivan-digital marked this pull request as ready for review March 28, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants