Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Score Documents - Use SBERT embedding instead of FastText #930

Merged
merged 2 commits into from
Mar 6, 2023

Conversation

PrimozGodec
Copy link
Collaborator

Issue

SBERT embedding is (in our opinion) more suitable for measuring distances between embeddings of documents and words. The reason for it is that it embeds complete text and not words separately, which leads to a better representation of the whole context of the document.

Description of changes

This PR replaces FastText embedding with SBERT in Score Document. It also better addresses some weaknesses of the widget:

  • It implements the option to send the list of documents to the server. It is used when sending words to the server so that each term is not sent in a separate request.
  • When embedding fail widget now shows the warning that similarity cannot be computed since some embeddings were unsuccessful. Before, it failed and didn't show any scores; now, it shows other scores except similarity in case of failure.
Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec force-pushed the score-documents-bert branch 2 times, most recently from c26aaca to d6e2f75 Compare January 18, 2023 16:14
@PrimozGodec
Copy link
Collaborator Author

Comparison of fastText (first image) and sBERT (second image) similarity for keyword "bicycle". They look similar, but it seems that sBERT is more focused on actual documents with searched content (on the right side of the image, fewer dots are bright green and yellow).

similarity-fasttext-bike
similarity-bert-bike

@PrimozGodec
Copy link
Collaborator Author

/rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants