Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score documents: fix word preprocessing #707

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

PrimozGodec
Copy link
Collaborator

Issue

The implementation of word preprocessing use protected method and does not work for all preprocessors

Description of changes

Implements word preprocessing via converting word list to Corpus. We should enhance preprocessors with a method that works on a list of strings in the feature, but we need more use cases like this. With only one use case it does not make sense to make such a big change in preprocessors.

Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec force-pushed the score-documents-preprocessors branch from ffc33fe to 7318613 Compare August 13, 2021 10:13
callback((i + 1) / len(pps))
return words
return [w[0] for w in words_c.tokens]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that there is a word in tokens.
E.g. using Remove urls may result in an empty list of tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@PrimozGodec PrimozGodec force-pushed the score-documents-preprocessors branch from 7318613 to e691a16 Compare August 13, 2021 13:33
callback((i + 1) / len(pps))
return words
return [w[0] for w in words_c.tokens if len(w) > 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may result in an empty list, which is could probably cause some unexpected behaviour.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(w) > 0 can be just len(w)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but I usually use len(w) > 0 because it is more readable. I can change to len(w) anyway

@PrimozGodec PrimozGodec force-pushed the score-documents-preprocessors branch from e691a16 to 90a054a Compare August 20, 2021 12:19
@codecov-commenter
Copy link

Codecov Report

Merging #707 (7318613) into master (96c295c) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 7318613 differs from pull request most recent head 90a054a. Consider uploading reports for the commit 90a054a to get more accurate results

@@            Coverage Diff             @@
##           master     #707      +/-   ##
==========================================
- Coverage   73.90%   73.88%   -0.02%     
==========================================
  Files          72       72              
  Lines        9427     9429       +2     
  Branches     1286     1287       +1     
==========================================
  Hits         6967     6967              
- Misses       2218     2219       +1     
- Partials      242      243       +1     

@PrimozGodec PrimozGodec force-pushed the score-documents-preprocessors branch from 90a054a to df9e115 Compare August 26, 2021 10:00
@VesnaT VesnaT merged commit 7a5f5fe into biolab:master Aug 27, 2021
@PrimozGodec PrimozGodec deleted the score-documents-preprocessors branch March 29, 2023 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants