Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Keywords - replace embedding with MBERT #932

Merged
merged 2 commits into from
Feb 28, 2023

Conversation

PrimozGodec
Copy link
Collaborator

Issue

Embedding keyword has poor performance and is slow.

Description of changes

Replace Embedding keyword extractor with MBERT with excellent performance. This keyword extractor extracts keywords with higher F1 scores than all existing extractors. The downside of this extractor is that it, on average extracts less than 5 keywords per document and also 0 many times.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov-commenter
Copy link

Codecov Report

Merging #932 (3859ad1) into master (db8f8c1) will increase coverage by 0.08%.
The diff coverage is 93.93%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #932      +/-   ##
==========================================
+ Coverage   77.54%   77.63%   +0.08%     
==========================================
  Files          86       86              
  Lines       12280    12266      -14     
  Branches     1609     1602       -7     
==========================================
  Hits         9523     9523              
+ Misses       2458     2446      -12     
+ Partials      299      297       -2     

from .decorators import *
from .widgets import *
from .concurrent import asynchronous


def enum2int(enum: Union[Enum, IntEnum]) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this imported from Orange.widgets.utils?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is actually already merged. I forgot to rebase this PR. :) Anyway enum2int from Orange is not released yet. I will add a test that will start to fail after some time and warn about removing it from text after some time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it was written here before it existed in Orange. I propose a time bomb that will start to fail and warn about removing it. #951

return b64encode(zlib.compress(instance.encode("utf-8", "replace"), level=9))


if __name__ == "__main__":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this, it crashes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@VesnaT
Copy link
Contributor

VesnaT commented Feb 28, 2023

Event with the two comments I'm happy to merge.

@PrimozGodec
Copy link
Collaborator Author

@VesnaT since you approved it I am merging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants