[ENH] Score Documents - enable matching n-grams #935

PrimozGodec · 2023-02-03T10:06:14Z

Issue

Currently Score Document matches only words from the input. When more-term words are on input, it fails with scoring (only score with first word from the term).

Description of changes

Apply same preprocessing on words that are used on corpus
Match n-grams from the corpus

With this modifications Corpus's preprocessing defines how words are preprocessed and matched. E.g. if the user selects to use bi-grams (only) on the corpus, terms (from words input) are matched as bigrams. In this case terms that are words are ignored, and those that are longer than bigrams are transformed to bigrams.

Includes

Code changes
Tests
Documentation

codecov-commenter · 2023-02-03T11:02:38Z

Codecov Report

Merging #935 (c8ecdf8) into master (f644a27) will not change coverage.
The diff coverage is 100.00%.

❗ Current head c8ecdf8 differs from pull request most recent head 84873b7. Consider uploading reports for the commit 84873b7 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #935   +/-   ##
=======================================
  Coverage   77.62%   77.62%           
=======================================
  Files          86       86           
  Lines       12291    12291           
  Branches     1609     1608    -1     
=======================================
  Hits         9541     9541           
  Misses       2452     2452           
  Partials      298      298

PrimozGodec assigned PrimozGodec and VesnaT and unassigned PrimozGodec Feb 3, 2023

PrimozGodec force-pushed the create-corpus-fix-tests branch from c8ecdf8 to a873fe0 Compare February 3, 2023 11:01

PrimozGodec force-pushed the create-corpus-fix-tests branch from a873fe0 to 84873b7 Compare February 3, 2023 11:04

Score Documents - enable matching n-grams

784eebf

PrimozGodec force-pushed the create-corpus-fix-tests branch from 84873b7 to 784eebf Compare February 3, 2023 11:31

VesnaT merged commit d36e906 into biolab:master Feb 3, 2023

PrimozGodec deleted the create-corpus-fix-tests branch February 3, 2023 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Score Documents - enable matching n-grams #935

[ENH] Score Documents - enable matching n-grams #935

PrimozGodec commented Feb 3, 2023

codecov-commenter commented Feb 3, 2023 •

edited

Loading

[ENH] Score Documents - enable matching n-grams #935

[ENH] Score Documents - enable matching n-grams #935

Conversation

PrimozGodec commented Feb 3, 2023

Issue

Description of changes

Includes

codecov-commenter commented Feb 3, 2023 • edited Loading

Codecov Report

codecov-commenter commented Feb 3, 2023 •

edited

Loading