feat: Making feast vector store with open ai search api compatible#6121
feat: Making feast vector store with open ai search api compatible#6121patelchaitany wants to merge 2 commits intofeast-dev:masterfrom
Conversation
e45f167 to
c8392a9
Compare
7e8adfb to
3f541ad
Compare
cd692b9 to
086bed5
Compare
086bed5 to
9778002
Compare
68c843c to
e29e557
Compare
f0552c7 to
dafe9fd
Compare
8189391 to
c904afb
Compare
2aee728 to
f3ccb68
Compare
f3ccb68 to
f005d2f
Compare
f005d2f to
551f680
Compare
|
Hey @franciscojavierarceo, @ntkathole can you pls review this PR. |
551f680 to
ba4f838
Compare
ba4f838 to
b94ff91
Compare
b94ff91 to
d394de4
Compare
7165fc0 to
67064ba
Compare
39665e0 to
424b4ae
Compare
Signed-off-by: Chaitany patel <[email protected]>
424b4ae to
269929a
Compare
269929a to
f7cddbf
Compare
Signed-off-by: Chaitany patel <[email protected]>
f7cddbf to
43d663e
Compare
| t1.created_ts | ||
| FROM {table_name} t1 | ||
| INNER JOIN vector_matches t2 ON t1.entity_key = t2.entity_key | ||
| WHERE t1.feature_name = ANY(%s) | ||
| WHERE {outer_where} | ||
| ORDER BY t2.distance | ||
| """ | ||
| ).format( | ||
| distance_metric_sql=sql.SQL(distance_metric_sql), | ||
| table_name=sql.Identifier(table_name), | ||
| outer_where=outer_where, | ||
| top_k=sql.Literal(top_k), | ||
| ) | ||
| params = (embedding, requested_features) | ||
| base_params = [embedding, requested_features] | ||
| if has_filters: | ||
| base_params.extend(filter_params) | ||
| params = tuple(base_params) |
There was a problem hiding this comment.
🔴 Postgres vector search applies metadata filters as post-filtering after top-k LIMIT, silently returning fewer results than requested
In retrieve_online_documents_v2, the vector_matches CTE computes the top-k closest entities by vector distance without considering metadata filters. Filters are only applied in the outer WHERE clause after the CTE's LIMIT. This means if a user requests top_k=10 with a filter, the CTE finds the 10 closest entities ignoring filters, then the outer query discards entities that don't match — potentially returning far fewer than 10 results even when many matching entities exist in the store.
In contrast, both the Elasticsearch backend (sdk/python/feast/infra/online_stores/elasticsearch_online_store/elasticsearch.py:591-599) and the Milvus backend (sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py:798-805) correctly apply filters inside the vector search (before the top-k cutoff), so they return up to top-k results that satisfy the filter.
Postgres CTE that ignores filters
The CTE at lines 912-921 limits to top_k without any filter:
WITH vector_matches AS (
SELECT entity_key,
MIN(vector_value <-> %s::vector) as distance
FROM {table_name}
WHERE vector_value IS NOT NULL -- no metadata filter here
GROUP BY entity_key
ORDER BY distance
LIMIT {top_k} -- limit applied before filter
)The filter is only applied in the outer query at line 934.
(Refers to lines 912-946)
Prompt for agents
The metadata filter needs to be applied INSIDE the vector_matches CTE (before the LIMIT), not just in the outer WHERE clause. Currently the CTE at lines 912-921 computes the top-k closest entities without considering filters, then the outer WHERE at line 934 discards non-matching entities, returning fewer than top_k results.
To fix: translate the filter clause so it can be embedded inside the CTE WHERE clause. The filter subquery pattern (entity_key IN SELECT entity_key FROM table WHERE feature_name = ... AND value_text ...) should be added to the CTE WHERE alongside the vector_value IS NOT NULL condition. The same fix is needed for all three search cases (vector-only at line 912, hybrid at line 823, and text-only at line 957).
For example, the vector-only CTE should become:
WITH vector_matches AS (
SELECT entity_key,
MIN(vector_value <distance_op> %s::vector) as distance
FROM {table_name}
WHERE vector_value IS NOT NULL
AND <filter_clause_without_alias>
GROUP BY entity_key
ORDER BY distance
LIMIT {top_k}
)
Note: the filter clause generation currently uses alias='t1' which won't work inside the CTE. You'll need to generate a version of the filter clause without an alias (or with a different alias) for use inside the CTE.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
I know about this and why i chose this way because when we do meta data filtering first then we need to retrieve more than top K results then after we need to apply vector similarity search on that which is going to do filter of top K but the thing is that in that process we might loose the closest search according to the vector similarity.
and in the Elastic search and Milvus the metadata filtering happens during the vector search so it is combined step over there not like here which is step by step process and this is limitation of the SQL and POSTGRES.
| order by distance | ||
| limit ? | ||
| ) f | ||
| left join {table_name} fv | ||
| left join {_quote_id(table_name)} fv | ||
| on f.rowid = fv.rowid | ||
| left join {table_name} fv2 | ||
| left join {_quote_id(table_name)} fv2 | ||
| on fv.entity_key = fv2.entity_key | ||
| where fv2.feature_name != "{vector_field}" | ||
| where {where_sql} | ||
| """, | ||
| ( | ||
| query_embedding_bin, | ||
| top_k, | ||
| ), | ||
| [query_embedding_bin, top_k] + vector_params + filter_params, |
There was a problem hiding this comment.
🔴 SQLite vector search applies metadata filters as post-filtering after top-k LIMIT, same issue as Postgres
Same issue as in Postgres. The vec_table match subquery at line 793 finds the top-k closest vectors without considering metadata filters. The filter clause is only applied in the outer WHERE at line 801, potentially returning fewer results than top_k even when many matching documents exist.
Elasticsearch and Milvus correctly pre-filter before the top-k cutoff, making this an inconsistency across backends.
SQLite query structure showing post-filtering
from (
select rowid, vector_value, distance
from vec_table
where vector_value match ? -- no metadata filter
order by distance
limit ? -- top-k applied before filter
) f
left join {table} fv on f.rowid = fv.rowid
left join {table} fv2 on fv.entity_key = fv2.entity_key
where fv2.feature_name != ? AND <filter_clause> -- filter applied after limit(Refers to lines 777-803)
Prompt for agents
The metadata filter needs to be applied BEFORE the top-k limit in the vec_table subquery, not in the outer WHERE clause. Currently the vec_table match at line 793 limits to top_k without filters, and the filter is applied at line 801.
SQLite vec0 doesn't support arbitrary WHERE clauses inside the match query, so the fix requires a different approach: first filter the entity_keys that match the metadata filter, then build the vec_table from only those entity_keys' vectors, then run the match. Alternatively, increase the vec_table limit significantly and apply the filter afterward, but re-rank/re-limit. The simplest correct fix might be to skip the vec_table virtual table approach when filters are present and instead compute distances manually for the filtered entity set.
Was this helpful? React with 👍 or 👎 to provide feedback.
What this PR does / why we need it:
This PR making the feast vector store api with open ai search api compatible so.
This are the changes are made.
POST /v1/vector_stores/{vector_store_id}/searchthat matches the OpenAI vector store search APIvector_store_idjust maps to a feature view nameretrieve_online_documents_v2, and returns results in OpenAI'svector_store.search_results.pageformatfeature_store.yamlunder a newembedding_modelsection (model, api_key, api_base, api_version, dimensionsfilter_models.pywith two Pydantic models:ComparisonFilter(eq, ne, gt, gte, lt, lte, in, nin) andCompoundFilter(and/or, nestable)field == 'value'entity_key IN (SELECT ...)enable_openai_compatible_storeconfig flag on every store backendvalue_numcolumn that stores int, float, double, and bool values natively alongside the existing `value_textvalueinstead ofvalue_textqueryparam renamed toembeddingsince that's what it actually isWhich issue(s) this PR fixes:
#5615
Misc