Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Score Documents - enable matching n-grams #935

Merged
merged 1 commit into from
Feb 3, 2023

Conversation

PrimozGodec
Copy link
Collaborator

Issue

Currently Score Document matches only words from the input. When more-term words are on input, it fails with scoring (only score with first word from the term).

Description of changes
  • Apply same preprocessing on words that are used on corpus
  • Match n-grams from the corpus

With this modifications Corpus's preprocessing defines how words are preprocessed and matched. E.g. if the user selects to use bi-grams (only) on the corpus, terms (from words input) are matched as bigrams. In this case terms that are words are ignored, and those that are longer than bigrams are transformed to bigrams.

Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec assigned PrimozGodec and VesnaT and unassigned PrimozGodec Feb 3, 2023
@PrimozGodec PrimozGodec force-pushed the create-corpus-fix-tests branch from c8ecdf8 to a873fe0 Compare February 3, 2023 11:01
@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2023

Codecov Report

Merging #935 (c8ecdf8) into master (f644a27) will not change coverage.
The diff coverage is 100.00%.

❗ Current head c8ecdf8 differs from pull request most recent head 84873b7. Consider uploading reports for the commit 84873b7 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #935   +/-   ##
=======================================
  Coverage   77.62%   77.62%           
=======================================
  Files          86       86           
  Lines       12291    12291           
  Branches     1609     1608    -1     
=======================================
  Hits         9541     9541           
  Misses       2452     2452           
  Partials      298      298           

@PrimozGodec PrimozGodec force-pushed the create-corpus-fix-tests branch from a873fe0 to 84873b7 Compare February 3, 2023 11:04
@PrimozGodec PrimozGodec force-pushed the create-corpus-fix-tests branch from 84873b7 to 784eebf Compare February 3, 2023 11:31
@VesnaT VesnaT merged commit d36e906 into biolab:master Feb 3, 2023
@PrimozGodec PrimozGodec deleted the create-corpus-fix-tests branch February 3, 2023 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants