Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Documents: Add conllu reader #675

Merged
merged 6 commits into from
Jul 23, 2021
Merged

Conversation

ajdapretnar
Copy link
Collaborator

@ajdapretnar ajdapretnar commented Jun 22, 2021

Issue

Conllu is a very standard format for text analysis and Orange didn't support it yet.

Description of changes
  • Add conllu reader to Import Documents. (It reads files and returns a data table with utterance per row (sentences are joined).)
  • Support metadata in .tsv format.
Includes
  • Code changes
  • Tests
  • Documentation
TODO
  • Interface for adding additional conllu attributes (lemmas, POS tags, NER)
  • Tests
  • Check how readers behave with different conllu formats
  • Check how conllu works with mixed file types
  • Fix error when checking boxes before selecting the folder

@ajdapretnar ajdapretnar requested a review from PrimozGodec June 22, 2021 11:54
@ajdapretnar ajdapretnar force-pushed the conllu branch 3 times, most recently from 3c08413 to da6b89a Compare June 30, 2021 09:40
@ajdapretnar ajdapretnar changed the title Import Documents: Add conllu reader [WIP] Import Documents: Add conllu reader Jul 6, 2021
@ajdapretnar
Copy link
Collaborator Author

The PR has been tested on several different conllu files:
https://www.clarin.si/repository/xmlui/handle/11356/1434
https://www.clarin.si/repository/xmlui/handle/11356/1441
https://www.clarin.si/repository/xmlui/handle/11356/1241
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3687 (Afrikaans, Cantonese, Turkish)

It can handle mixed data types (i.e. conllu & txt), but in this case, there will be no preprocessing (it will ignore lemmas, tags and NER).

It would be nice to add a Conllu reader to Corpus, to read single conllu files. But this is out of the scope of this PR.

@ajdapretnar
Copy link
Collaborator Author

Major issue still to be solved: how to handle named entities? Currently, they appear as a comma-separated string, which is less than ideal. The user then has to once again tokenize it to count the appearances. Should we create a new Corpus property and do something with it? 🤔

@codecov-commenter
Copy link

codecov-commenter commented Jul 16, 2021

Codecov Report

Merging #675 (68b4408) into master (eddfa45) will increase coverage by 0.23%.
The diff coverage is 90.06%.

@@            Coverage Diff             @@
##           master     #675      +/-   ##
==========================================
+ Coverage   74.09%   74.33%   +0.23%     
==========================================
  Files          72       72              
  Lines        9230     9362     +132     
  Branches     1253     1276      +23     
==========================================
+ Hits         6839     6959     +120     
- Misses       2149     2153       +4     
- Partials      242      250       +8     

orangecontrib/text/import_documents.py Outdated Show resolved Hide resolved
orangecontrib/text/import_documents.py Outdated Show resolved Hide resolved
orangecontrib/text/widgets/owimportdocuments.py Outdated Show resolved Hide resolved
@ajdapretnar ajdapretnar changed the title [WIP] Import Documents: Add conllu reader Import Documents: Add conllu reader Jul 19, 2021
@@ -635,10 +661,27 @@ def __onReportProgress(self, arg):
self.pathlabel.setText(prettifypath(arg.lastpath))
self.progress_widget.setValue(int(100 * arg.progress))

def add_features(self):
self.Warning.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clearing the warnings here intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, it as probably left here from when there was a warning about missing lemmas (if the user selected only POS tags and not lemmas). Now we removed the warning and POS tags are loaded without lemmas - whatever that may mean for downstream preprocessing (leave it to the user). I'll remove the line.

@VesnaT VesnaT merged commit b093ab5 into biolab:master Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants