Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Warning about Mismatch Between similarity function of Embedding Model and Index space_type #2356

Open
YeonghyeonKO opened this issue Dec 26, 2024 · 3 comments

Comments

@YeonghyeonKO
Copy link

Is your feature request related to a problem?

  • There can be a problem when embedding vectors(ex. msmarco-distilbert-base-tas-b; say it's similarity function is cosine similarity) are indexed if we map the knn_vector field with a different space_type. (ex. L2)
  • The distance calculated from the embedding model's weights and the vector distance from a HNSW Graph can differ, leading to inaccurate search scores.
  • This means that since OpenSearch stores HNSW Graph structures of each segment created by Faiss/NMSLIB/Lucene, search results from the graph could vary depending on the space_type.

What solution would you like?

  • Are there any benefits to using different space_type values with the similarity function of embedding models?
  • I suggest displaying warning messages in the above scenario to alert users to potential inaccuracies.
@navneet1v navneet1v added question Further information is requested and removed untriaged enhancement labels Dec 27, 2024
@navneet1v
Copy link
Collaborator

@YeonghyeonKO this is an interesting ask, but since Opensearch can run in wide variety of environment, I don't see how opensearch can know what is the model being used to ingest the vectors in Opensearch and what space type the model is using.

@navneet1v navneet1v removed the question Further information is requested label Dec 27, 2024
@YeonghyeonKO
Copy link
Author

@navneet1v Oh, from the perspective of a high degree of freedom as you said, what I've asked depends on the user side, not OpenSearch's. Also, since OpenSearch allow users to deploy custom ML models, the mismatch problem I've been worried should be well controlled/solved by them. It's up to us, not OpenSearch haha

@heemin32
Copy link
Collaborator

heemin32 commented Dec 30, 2024

@YeonghyeonKO I believe your request can be addressed through this GitHub issue. Please consider giving it a +1 if you'd like to have the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants