Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Ontology - remove cache and other fixes #896

Merged
merged 2 commits into from
Oct 5, 2022

Conversation

PrimozGodec
Copy link
Collaborator

@PrimozGodec PrimozGodec commented Aug 30, 2022

Issue
Description of changes
  • Since now SBERT sends single documents (words in the case of ontology) to the server, and SBERT embedded already uses an embedding cache, a separate ontology cache is no longer required. This PR removes the ontology embedding cache. Anyway, sending single words to the embedded is not optimal and is slow in the case of terms, so in the future, I will change the SBERT embedded to have the option to send more words in the single request (and the cache will be handled there).
  • This PR also removes the similarity cache. It is not required since I found out that it is faster to compute pairwise cosine similarity (for the whole matrix at the time) than looping over the cache to find cached results.
  • Handle cases when some embedding cannot be retrieved (most likely due to connection error). When generating, non-embedded words will be excluded, and the user will get a warning in the widget. To discuss: is that a suitable solution, or should computation fail?
  • Fixes incompatibilities with PyQt6 in the Ontology widget
Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec marked this pull request as draft August 30, 2022 11:14
@codecov-commenter
Copy link

Codecov Report

Merging #896 (78e7010) into master (339ad59) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #896      +/-   ##
==========================================
+ Coverage   76.94%   76.99%   +0.04%     
==========================================
  Files          86       86              
  Lines       12151    12096      -55     
  Branches     1905     1896       -9     
==========================================
- Hits         9350     9313      -37     
+ Misses       2485     2468      -17     
+ Partials      316      315       -1     

@PrimozGodec PrimozGodec marked this pull request as ready for review September 2, 2022 10:46
@PrimozGodec
Copy link
Collaborator Author

Rebase after #900 is merged

@PrimozGodec
Copy link
Collaborator Author

/rebase

@ajdapretnar
Copy link
Collaborator

For reference, I did get an error mid process:

---------------------------- IndexError Exception -----------------------------
Traceback (most recent call last):
  File "/Users/ajda/orange/orange3-text/orangecontrib/text/widgets/owontology.py", line 796, in __on_ontology_data_changed
    self.__update_score()
  File "/Users/ajda/orange/orange3-text/orangecontrib/text/widgets/owontology.py", line 859, in __update_score
    score = round(self.__onto_handler.score(tree), 2) \
  File "/Users/ajda/orange/orange3-text/orangecontrib/text/ontology.py", line 293, in score
    embeddings = self.embedder(tree.labels, wrap_callback(callback, end=0.7))
  File "/Users/ajda/orange/orange3-text/orangecontrib/text/vectorization/sbert.py", line 50, in __call__
    sorted_texts = sorted(
  File "/Users/ajda/orange/orange3-text/orangecontrib/text/vectorization/sbert.py", line 52, in <lambda>
    key=lambda x: len(x[1][0]) if x[1] is not None else 0,
IndexError: string index out of range
-------------------------------------------------------------------------------

However, it was not the widget that raised it, just saw it in the background. Thus I cannot say when it happened (how to reproduce it). I tried, but now (of course) everything works.

So I am ignoring this error for now, but at least we are aware it might happen.

@ajdapretnar ajdapretnar merged commit 6a055b3 into biolab:master Oct 5, 2022
@ajdapretnar
Copy link
Collaborator

I know when the issue happens. It is described in #883.

@PrimozGodec PrimozGodec deleted the fix-ontology-cache branch October 5, 2022 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ontology.py: saving unsanitized words
3 participants