Skip to content

Fix buffered MinHashLSH query aggregation across storage backends#307

Open
dipeshbabu wants to merge 4 commits intoekzhu:masterfrom
dipeshbabu:fix/lsh-query-buffer-union
Open

Fix buffered MinHashLSH query aggregation across storage backends#307
dipeshbabu wants to merge 4 commits intoekzhu:masterfrom
dipeshbabu:fix/lsh-query-buffer-union

Conversation

@dipeshbabu
Copy link
Copy Markdown
Contributor

@dipeshbabu dipeshbabu commented Mar 31, 2026

Summary

Fix MinHashLSH.collect_query_buffer() so buffered queries aggregate candidates the same way as repeated calls to query(), including when
using the Cassandra storage backend.

Problem

The buffered query path was intersecting per-band result sets directly. That is stricter than normal LSH query behavior, which unions
candidates across bands for a query and only then intersects across multiple buffered queries.

This caused valid candidates to be dropped when using buffered queries.

The Cassandra backend also exposed a related issue in buffered selects: repeated buffered lookups with the same hash key could be collapsed
instead of preserving one result list per buffered query. That breaks per-query aggregation logic.

Fix

  • union bucket hits across bands for each buffered query
  • intersect only the per-query candidate sets across the buffer
  • preserve existing prepickle behavior
  • make Cassandra buffered selects preserve query order and count, including duplicate hash-key lookups
  • replace a broken LSH Forest documentation link with a stable reference

Test

  • add a regression test showing collect_query_buffer() returns the same candidates as query() for a case where the old implementation
    dropped a valid match

Verification

  • confirmed with a direct local repro that buffered and non-buffered query paths now both return [0, 1]
  • ran uvx ruff check .
  • ran the README test command uv run pytest in Linux/WSL: 158 passed, 76 skipped
  • verified the docs link check fix locally with lychee

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the collect_query_buffer method in datasketch/lsh.py to correctly process buffered queries by unioning candidates across bands for each query before intersecting across the buffer. It also adds a test case to verify that buffered results match direct query results. Feedback identifies a potential bug where the use of zip could truncate results when using the Cassandra storage backend and suggests a more efficient implementation for the set().union() call.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 31, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@cc635ed). Learn more about missing BASE report.

Files with missing lines Patch % Lines
datasketch/lsh.py 87.50% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff            @@
##             master     #307   +/-   ##
=========================================
  Coverage          ?   77.52%           
=========================================
  Files             ?       15           
  Lines             ?     2060           
  Branches          ?        0           
=========================================
  Hits              ?     1597           
  Misses            ?      463           
  Partials          ?        0           
Flag Coverage Δ
unittests 77.52% <88.88%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dipeshbabu dipeshbabu changed the title Fix MinHashLSH collect_query_buffer candidate aggregation Fix buffered MinHashLSH query aggregation across storage backends Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants