-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partners, elasticsearch: Enable ElasticsearchStore
to retrieve with the pure BM25 algorithm without vector search
#19314
partners, elasticsearch: Enable ElasticsearchStore
to retrieve with the pure BM25 algorithm without vector search
#19314
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
cc @joemcelroy |
"settings": { | ||
"similarity": { | ||
"custom_bm25": { | ||
"type": "BM25", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is being able to specify similarity options a requirement for you?
k1: Optional. Default is 2.0. This corresponds to the BM25 parameter, k1. | ||
b: Optional. Default is 0.75. This corresponds to the BM25 parameter, b. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning behind these defaults? It might make more sense to stick to Elasticsearch's defaults:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not know if these defaults can change in the future, so it's best to not apply any hardcoded defaults in here, the ES service should pick its own defaults, whatever they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my previous comment, I agree that the default values should be None
. I have made this modification in this commit: 31f7909.
|
||
@staticmethod | ||
def BM25RetrievalStrategy( | ||
k1: float = 2.0, b: float = 0.75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might consider setting the defaults on the init function, so that BM25RetrievalStrategy()
is simple to use as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest making these two arguments None
by default, and only include them in the mapping definition when the user provided custom values. That would allow the ES service to apply its own defaults for the general case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct reason for setting these default values is because the ElasticsearchBM25Retriever
utilizes them. However, I completely agree with @miguelgrinberg 's viewpoint that the default values should be set to None
, allowing Elasticsearch's default values to be used unless explicitly specified by the user. This approach will ensure that the behavior of ElasticsearchStore
aligns more consistently with Elasticsearch itself.
@@ -70,6 +70,7 @@ def index( | |||
self, | |||
dims_length: Union[int, None], | |||
vector_query_field: str, | |||
text_field: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: this could be str | str[]
, and when a list is provided a multi_match
query is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine as a single field as developers are not able to search on multiple fields within a VectorStore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @joemcelroy mentioned, this is because of the limitations of the VectorStore
and BaseRetrievalStrategy
. I'm also hoping for the ElasticsearchStore
to support multi-match searches in the future. But, let's keep that outside this PR for now and think about it as a future work.
class BM25RetrievalStrategy(BaseRetrievalStrategy): | ||
"""Retrieval strategy using the native BM25 algorithm of Elasticsearch.""" | ||
|
||
def __init__(self, k1: float, b: float): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The k1
and b
arguments should default to None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I addressed in this commit: a2eb03a
"custom_bm25": { | ||
"type": "BM25", | ||
"k1": self.k1, | ||
"b": self.b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The k1
and b
settings should be given in the mapping only when custom values have been passed. If they are None
then they should not be included, to allow Elasticsearch to use its own defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out. I have addressed it in this commit: 31f7909
thanks so much for your contribution. The last part is could you document this too https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/elasticsearch.ipynb |
e92779a
to
e1ab8f8
Compare
@joemcelroy I've added the documentation. PTAL. commit: e1ab8f8 |
Hm, @joemcelroy @miguelgrinberg @maxjakob @baskaryan @efriis |
yep moving to langchain-elastic makes sense! |
Thank you for your comment. I have recreated the PR in the langchain-elastic repository: langchain-ai/langchain-elastic#6 I will now close this PR in the LangChain main repository. |
…#20098) This pull request follows up on #19314 and langchain-ai/langchain-elastic#6, adding documentation for the `ElasticsearchStore.BM25RetrievalStrategy`. Like other retrieval strategies, we are now introducing BM25RetrievalStrategy. ### Background - The `BM25RetrievalStrategy` has been introduced to `langchain-elastic` via the pull request langchain-ai/langchain-elastic#6. - This PR was initially created in the main `langchain` repository but was moved to `langchain-elastic` during the review process due to the migration of the partner package. - The original PR can be found at #19314. - As [commented](#19314 (comment)) by @joemcelroy, documenting the new retrieval strategy is part of the requirements for its introduction. Although the `BM25RetrievalStrategy` has been merged into `langchain-elastic`, its documentation is still to be maintained in the main `langchain` repository. Therefore, this pull request adds the documentation portion of `BM25RetrievalStrategy`. The content of the documentation remains the same as that included in the original PR, #19314. --------- Co-authored-by: Max Jakob <[email protected]>
…#20098) This pull request follows up on #19314 and langchain-ai/langchain-elastic#6, adding documentation for the `ElasticsearchStore.BM25RetrievalStrategy`. Like other retrieval strategies, we are now introducing BM25RetrievalStrategy. ### Background - The `BM25RetrievalStrategy` has been introduced to `langchain-elastic` via the pull request langchain-ai/langchain-elastic#6. - This PR was initially created in the main `langchain` repository but was moved to `langchain-elastic` during the review process due to the migration of the partner package. - The original PR can be found at #19314. - As [commented](#19314 (comment)) by @joemcelroy, documenting the new retrieval strategy is part of the requirements for its introduction. Although the `BM25RetrievalStrategy` has been merged into `langchain-elastic`, its documentation is still to be maintained in the main `langchain` repository. Therefore, this pull request adds the documentation portion of `BM25RetrievalStrategy`. The content of the documentation remains the same as that included in the original PR, #19314. --------- Co-authored-by: Max Jakob <[email protected]>
Description
This pull request proposes the implementation of the
BM25RetrievalStrategy
forElasticsearchStore
. This retrieval strategy enables searches purely based on BM25 without involving vector search. Below, the usage example, motivation, and details of the changes are discussed.Usage Example of Introduced Feature
By specifying the
BM25RetrievalStrategy
as a constructor argument forElasticsearchStore
, users can perform searches using pure BM25 without vector search. Note that in the example below, the embedding option is not specified, indicating that the search is conducted without using embeddings.The example above outputs:
Motivation
There is a considerable demand for using Elasticsearch as a pure BM25 retriever without vector search. Although hybrid searches combining vector search and BM25 are supported alongside
ApproxRetrievalStrategy
, there are cases where pure BM25 searches are needed, for example, when seeking speed performance in search or prioritizing exact matches over semantic searches.Pure BM25 retrievers, such as
ElasticsearchBM25Retriever
andBM25Retriever
, have already been implemented inlangchain_community
. However, these classes do not offer the rich and flexible Elasticsearch features supported byElasticsearchStore
, such as various authentication options and flexible querying withcustom_query
. Additionally, being a subclass ofVectorStore
, it benefits from components likeRecordManager
, which are advantageous during operational phases. Therefore, supporting pure BM25 searches inElasticsearchStore
presents significant benefits.To achieve this, an easy-to-use interface is necessary. The abstraction of
ElasticsearchStore
itself is sophisticated enough to support pure BM25 without vector search, as it allows inputting a strategy class inheriting fromBaseRetrievalStrategy
. However, implementing a BM25 search by inheriting fromBaseRetrievalStrategy
can be challenging for general users (as it took me several days). Without native support from the library, it's difficult for users to arrive at this solution.Therefore, this PR suggests implementing a new retrieval strategy,
BM25RetrievalStrategy
, to enableElasticsearchStore
to support pure BM25 searches.Change Details
BM25RetrievalStrategy
class.text_field
, to theindex
method ofBaseRetrievalStrategy
.BaseRetrievalStrategy
in their projects is relatively small. Consequently, the overall impact of this change should be minimal. Nonetheless, I am open to and would appreciate any feedback on this matter.BM25RetrievalStrategy
, namedtest_similarity_search_bm25_search*
. Following the example of other retrieval strategy classes, scenarios with and without using the filter option were tested.