Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Import documents: Import from URL #637

Merged
merged 3 commits into from
Apr 28, 2021

Conversation

VesnaT
Copy link
Contributor

@VesnaT VesnaT commented Mar 30, 2021

Issue
Description of changes
  • introduce URL reader
  • add URL option to Import Documents widget
Includes
  • Code changes
  • Tests
  • Documentation

@VesnaT VesnaT force-pushed the import_documents_url branch from 08e93cf to 6ca6394 Compare March 30, 2021 12:00
@codecov-io
Copy link

Codecov Report

Merging #637 (6ca6394) into master (4c4069b) will increase coverage by 0.31%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master     #637      +/-   ##
==========================================
+ Coverage   71.29%   71.60%   +0.31%     
==========================================
  Files          66       66              
  Lines        7806     7939     +133     
  Branches     1027     1037      +10     
==========================================
+ Hits         5565     5685     +120     
- Misses       2038     2048      +10     
- Partials      203      206       +3     

@PrimozGodec
Copy link
Collaborator

One more comment that we do not forget: this implementation is using add_column on a table. We need to raise the minimum version of Orange before it is merged to 3.28 (if I am right it is the first version supporting add_column).

@VesnaT
Copy link
Contributor Author

VesnaT commented Apr 6, 2021

True, I'll raise the version.

@VesnaT VesnaT force-pushed the import_documents_url branch 2 times, most recently from 737cdb4 to d58b7e0 Compare April 8, 2021 08:58
from Orange.util import Registry

from orangecontrib.text.corpus import Corpus


DefaultFormats = ("docx", "odt", "txt", "pdf", "xml")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should .yaml also be specified here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should not. These formats are 'documents' formats. Each file of these format represents a document. Metadata is considered differently, since it appends data to each document. Its formats are hardcoded in _read_meta_data().

@ajdapretnar
Copy link
Collaborator

I noticed there's a YAML reader and a function that should recognize metadata. But I don't see how this is reflected in the widget.

@VesnaT
Copy link
Contributor Author

VesnaT commented Apr 9, 2021

It is not. It just reads the metadata, if there is any. Do you think we should add something to the GUI? What do you propose?
Maybe the name of the KEY column (the one that connects the metadata and the document).

@VesnaT VesnaT force-pushed the import_documents_url branch 2 times, most recently from 776982f to 72ed961 Compare April 14, 2021 11:13
@PrimozGodec PrimozGodec mentioned this pull request Apr 16, 2021
3 tasks
@VesnaT VesnaT force-pushed the import_documents_url branch from 72ed961 to c23a605 Compare April 28, 2021 07:55
@VesnaT VesnaT force-pushed the import_documents_url branch from c23a605 to f1d0dcf Compare April 28, 2021 12:47
@PrimozGodec PrimozGodec merged commit f2d7b44 into biolab:master Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants