Skip to content

Short-circuit HNSW search for similarity-based vector queries? #15869

@kaivalnp

Description

@kaivalnp

Description

Spinoff from #15836

A KNN query short-circuits the HNSW search if the "expected" number of nodes visited is >= number of filtered nodes.

A similarity-based vector query (i.e. [Byte|Float]VectorSimilarityQuery) attempts to find all vectors with a score above a threshold (for Euclidean similarity, this can be imagined as all vectors within a radius of the query vector).

Assuming document vectors are evenly spread out across the n-dimensional space, should vector similarity scores form a normal distribution?

If so, can we estimate the proportion of nodes visited using area under the curve (from resultSimilarity -> ) of a normal distribution? (and apply the same short circuit logic)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions