MB-58901: Introduce support for BM25 scoring #2113

Thejas-bhat · 2024-12-06T06:50:35Z

Introducing support for BM25 scoring

Key stats necessary for the scoring

fieldLength - the number of terms in a field within a doc.
avgDocLength - the average of terms in a field across all the docs in the index.
totalDocs - total number of docs in an index.

Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a bleve.IndexAlias. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit.

Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism either at indexMapping.DefaultSimilarity or the fieldMapping.Similarity level to actually use this scoring mechanism.
The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the avgDocLength. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score.

Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether.

Thejas-bhat force-pushed the bm25-refactor branch from 2b54a8d to 738dfe1 Compare December 6, 2024 06:51

Thejas-bhat force-pushed the presearchRefactor branch from 8b10cdf to d58474f Compare December 6, 2024 06:54

Thejas-bhat force-pushed the bm25-refactor branch 5 times, most recently from 4b626d0 to 45efde1 Compare December 12, 2024 10:39

Base automatically changed from presearchRefactor to master December 17, 2024 08:52

metonymic-smokey and others added 16 commits January 2, 2025 11:00

hacky start

bbe4ae7

use ctx in term srch

a679009

field cardinality temp save

2d8a43d

average doc length stat for a field

52b1768

bm25 scoring first implementation

42082f8

notes and keep the default tf-idf stuff

a52bd49

bug fixes and BM25 UT pass

36159b6

making bm25 presearch (i.e. global scoring) optional

f3424b5

field mapping to capture type of scoring; bm25 by default

d393616

bug fixes, unit test fixes

55e63fd

cleanup/refactor

04e1e72

bug fixes

ab58975

fix scatter-gather path

dbed957

bug fixes after merge conflict resolution

52e318d

score explanation

36db386

default similarity config for an index

e83cca0

Thejas-bhat force-pushed the bm25-refactor branch from f385ba6 to e83cca0 Compare January 6, 2025 07:16

cleanup

a643a3b

Thejas-bhat changed the title ~~WIP: BM25 scoring~~ MB-58901: Introduce support for BM25 scoring Jan 6, 2025

Thejas-bhat marked this pull request as ready for review January 6, 2025 11:44

Thejas-bhat requested review from abhinavdangeti and metonymic-smokey January 6, 2025 11:45

Thejas-bhat requested review from CascadingRadium and Likith101 January 6, 2025 11:45

abhinavdangeti added this to the v2.5.0 milestone Jan 6, 2025

keeping scoring as an index level config for consistency

b5a7c9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MB-58901: Introduce support for BM25 scoring #2113

MB-58901: Introduce support for BM25 scoring #2113

Thejas-bhat commented Dec 6, 2024 •

edited

Loading

MB-58901: Introduce support for BM25 scoring #2113

Are you sure you want to change the base?

MB-58901: Introduce support for BM25 scoring #2113

Conversation

Thejas-bhat commented Dec 6, 2024 • edited Loading

Thejas-bhat commented Dec 6, 2024 •

edited

Loading