Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Document embedding #504

Merged
merged 2 commits into from
Apr 22, 2020
Merged

Conversation

djukicn
Copy link
Collaborator

@djukicn djukicn commented Mar 6, 2020

Issue

Bag of words is currently the only method for vector representation of documents in text add-on. This pull request adds another way for doing it using fastText pretrained models described in:
E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov,
Learning Word Vectors for 157 Languages.
Proceedings of the International Conference on Language Resources and Evaluation, 2018.

Description of changes
  • A new widget and script for obtaining document embeddings from a given corpus are added.
  • Embeddings are calculated on a server thus eliminating the need for loading large models locally.
  • Currently supported languages are English, Slovenian, and German.

How to test?

Until the cluster does not get a permanent address is not hardcoded in the script. To test run Orange with setting the environment variable:

ORANGE_EMBEDDING_API_URL=https://apiv2.garaza.io orange-canvas

The environment variable can be also added in the configuration in PyCharm.

Includes
  • Code changes
  • Tests
  • Documentation

orangecontrib/text/vectorization/dense.py Outdated Show resolved Hide resolved
orangecontrib/text/vectorization/dense.py Outdated Show resolved Hide resolved
orangecontrib/text/vectorization/dense.py Outdated Show resolved Hide resolved
orangecontrib/text/vectorization/dense.py Outdated Show resolved Hide resolved
orangecontrib/text/widgets/owdocumentembedding.py Outdated Show resolved Hide resolved
orangecontrib/text/widgets/owdocumentembedding.py Outdated Show resolved Hide resolved
orangecontrib/text/widgets/owdocumentembedding.py Outdated Show resolved Hide resolved
orangecontrib/text/widgets/owdocumentembedding.py Outdated Show resolved Hide resolved
orangecontrib/text/widgets/owdocumentembedding.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Mar 19, 2020

Codecov Report

Merging #504 into master will increase coverage by 1.15%.
The diff coverage is 95.95%.

@@            Coverage Diff             @@
##           master     #504      +/-   ##
==========================================
+ Coverage   63.76%   64.91%   +1.15%     
==========================================
  Files          59       62       +3     
  Lines        6306     6539     +233     
  Branches      828      851      +23     
==========================================
+ Hits         4021     4245     +224     
- Misses       2151     2157       +6     
- Partials      134      137       +3     

@djukicn djukicn force-pushed the dense-embeddings branch 9 times, most recently from f5d2ea5 to 41c11a4 Compare April 13, 2020 15:23
@djukicn djukicn force-pushed the dense-embeddings branch 2 times, most recently from a48d225 to c775a4a Compare April 15, 2020 12:14
doc/widgets/documentembedding.md Show resolved Hide resolved
doc/widgets/documentembedding.md Outdated Show resolved Hide resolved
doc/widgets/documentembedding.md Outdated Show resolved Hide resolved
doc/widgets/documentembedding.md Show resolved Hide resolved
@djukicn djukicn force-pushed the dense-embeddings branch 3 times, most recently from a73d351 to 9f2919b Compare April 17, 2020 08:39
@ajdapretnar
Copy link
Collaborator

I like this widget, it works well. I think it we should merge this as soon as it is remotely operational and then we address some minor details in future PRs.

Something that could probably be improved is what happens when the internet is off. It works quite well in Image Embedding, but here it still takes a while for the error to appear. Perhaps it can't be fixed, perhaps it can.

Also, why is auto_commit box so big? There's a weird extra space above the button, but not because of the PR - it seems to be the problem of the inherited box itself. Any ideas how to fix this?

@djukicn
Copy link
Collaborator Author

djukicn commented Apr 20, 2020

@ajdapretnar
I haven't been able to reproduce the case where it takes a while for the error to appear. It appears almost instantly for me.
As for the size of auto commit box, I'll explore it further and hopefully find a way to fix it.

@djukicn djukicn force-pushed the dense-embeddings branch 2 times, most recently from 07261b6 to 256de32 Compare April 20, 2020 11:44
@PrimozGodec
Copy link
Collaborator

When the internet is off there are two possible cases:

  • when internet is already off when a request is sent the error is raised immediately.
  • when the connection is lost during the embedding, it happens that embedded is waiting for responses that were already sent and then it times out after a minute. In this case, it takes longer to realize that the connection is off.

The last case is less probable. The same error is in the Image embedding. I think this behavior is ok from the user's point of view. Also when the internet is off and you request a specific website in the browser it will try to get it for you for some time and then write it is not reachable.

@PrimozGodec
Copy link
Collaborator

PrimozGodec commented Apr 21, 2020

@ajda for me widget do not have any extra space around the auto-apply box
Screenshot 2020-04-21 12 42 57

@djukicn djukicn force-pushed the dense-embeddings branch 2 times, most recently from 1ceea31 to 01c92de Compare April 21, 2020 11:25
@djukicn djukicn changed the title [WIP] Document embedding [ENH] Document embedding Apr 21, 2020
@PrimozGodec
Copy link
Collaborator

Now I see what @ajdapretnar taught with the extra space. It seems that it appears only at Ubuntu (maybe Windows). @djukicn, do you also have this extra space above the auto-commit button on the other widgets?

Now URL is set and Matjaz said that the server can be in production for the document embedder. So from my side, it can be merged. @ajda I suggest that you merge it if everything is ok from your side.

@PrimozGodec PrimozGodec self-requested a review April 21, 2020 13:43
@ajdapretnar
Copy link
Collaborator

Perfect, I will have a final look and merge. I can't wait for this one! 🎉

@djukicn
Copy link
Collaborator Author

djukicn commented Apr 21, 2020

@PrimozGodec @ajdapretnar I've managed to get rid of the empty space by adding box=False property to auto_commit.

@ajdapretnar
Copy link
Collaborator

@djukicn could you please rebase the branch so that I can merge asap? :) Thanks!

@djukicn
Copy link
Collaborator Author

djukicn commented Apr 22, 2020

@ajdapretnar Done.

@ajdapretnar ajdapretnar merged commit 3677f8e into biolab:master Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants