Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MB-58901: Introduce support for BM25 scoring #2113

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Thejas-bhat
Copy link
Member

@Thejas-bhat Thejas-bhat commented Dec 6, 2024

Introducing support for BM25 scoring

Key stats necessary for the scoring

  • fieldLength - the number of terms in a field within a doc.
  • avgDocLength - the average of terms in a field across all the docs in the index.
  • totalDocs - total number of docs in an index.

Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a bleve.IndexAlias. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit.

Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism either at indexMapping.DefaultSimilarity or the fieldMapping.Similarity level to actually use this scoring mechanism.
The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the avgDocLength. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score.

Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether.

@Thejas-bhat Thejas-bhat force-pushed the bm25-refactor branch 5 times, most recently from 4b626d0 to 45efde1 Compare December 12, 2024 10:39
Base automatically changed from presearchRefactor to master December 17, 2024 08:52
@Thejas-bhat Thejas-bhat changed the title WIP: BM25 scoring MB-58901: Introduce support for BM25 scoring Jan 6, 2025
@Thejas-bhat Thejas-bhat marked this pull request as ready for review January 6, 2025 11:44
@abhinavdangeti abhinavdangeti added this to the v2.5.0 milestone Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants