Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize: speedup preprocessing with caching #709

Merged
merged 1 commit into from
Aug 23, 2021

Conversation

PrimozGodec
Copy link
Collaborator

@PrimozGodec PrimozGodec commented Aug 16, 2021

Issue

Normalisation is the slowest process in preprocessing (especially UDPIPE).

Description of changes

Speedups:
Udpipe:

  • predlogi vladi with 1000 documents: 16s -> 2.4s
  • zakoni o registrih: 1066s -> 8s
  • 20 newsgroups: 182s -> 7s

WordNet:

  • 13s -> 4.7s*

Porter:

  • 46s -> 4s*

Snowball:

  • 55s -> 3.5s*

* on 20 newsgroups dataset

Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec changed the title normalize: speedup preprocessing with lru_cache normalize: speedup preprocessing with caching Aug 17, 2021
@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2021

Codecov Report

Merging #709 (cedb7d6) into master (994ff6a) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
- Coverage   74.01%   74.00%   -0.01%     
==========================================
  Files          72       72              
  Lines        9477     9484       +7     
  Branches     1291     1292       +1     
==========================================
+ Hits         7014     7019       +5     
- Misses       2218     2219       +1     
- Partials      245      246       +1     

@@ -1,3 +1,4 @@
from functools import lru_cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove it after I change my implementation. It is now removed. Thanks

@PrimozGodec PrimozGodec force-pushed the preprocess-speedup branch 2 times, most recently from e5e35d1 to 4feabd1 Compare August 23, 2021 12:22
@lanzagar lanzagar merged commit e8a41fb into biolab:master Aug 23, 2021
@PrimozGodec PrimozGodec deleted the preprocess-speedup branch June 3, 2022 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants