Import Documents: Add conllu reader #675

ajdapretnar · 2021-06-22T11:54:44Z

Issue

Conllu is a very standard format for text analysis and Orange didn't support it yet.

Description of changes

Add conllu reader to Import Documents. (It reads files and returns a data table with utterance per row (sentences are joined).)
Support metadata in .tsv format.

Includes

Code changes
Tests
Documentation

TODO

Interface for adding additional conllu attributes (lemmas, POS tags, NER)
Tests
Check how readers behave with different conllu formats
Check how conllu works with mixed file types
Fix error when checking boxes before selecting the folder

ajdapretnar · 2021-07-13T13:16:03Z

The PR has been tested on several different conllu files:
https://www.clarin.si/repository/xmlui/handle/11356/1434
https://www.clarin.si/repository/xmlui/handle/11356/1441
https://www.clarin.si/repository/xmlui/handle/11356/1241
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3687 (Afrikaans, Cantonese, Turkish)

It can handle mixed data types (i.e. conllu & txt), but in this case, there will be no preprocessing (it will ignore lemmas, tags and NER).

It would be nice to add a Conllu reader to Corpus, to read single conllu files. But this is out of the scope of this PR.

ajdapretnar · 2021-07-13T14:11:40Z

Major issue still to be solved: how to handle named entities? Currently, they appear as a comma-separated string, which is less than ideal. The user then has to once again tokenize it to count the appearances. Should we create a new Corpus property and do something with it? 🤔

codecov-commenter · 2021-07-16T12:41:05Z

Codecov Report

Merging #675 (68b4408) into master (eddfa45) will increase coverage by 0.23%.
The diff coverage is 90.06%.

@@            Coverage Diff             @@
##           master     #675      +/-   ##
==========================================
+ Coverage   74.09%   74.33%   +0.23%     
==========================================
  Files          72       72              
  Lines        9230     9362     +132     
  Branches     1253     1276      +23     
==========================================
+ Hits         6839     6959     +120     
- Misses       2149     2153       +4     
- Partials      242      250       +8

orangecontrib/text/import_documents.py

orangecontrib/text/widgets/owimportdocuments.py

orangecontrib/text/import_documents.py

VesnaT · 2021-07-22T10:43:40Z

orangecontrib/text/widgets/owimportdocuments.py

@@ -635,10 +661,27 @@ def __onReportProgress(self, arg):
            self.pathlabel.setText(prettifypath(arg.lastpath))
            self.progress_widget.setValue(int(100 * arg.progress))

+    def add_features(self):
+        self.Warning.clear()


Is clearing the warnings here intentional?

Uh, it as probably left here from when there was a warning about missing lemmas (if the user selected only POS tags and not lemmas). Now we removed the warning and POS tags are loaded without lemmas - whatever that may mean for downstream preprocessing (leave it to the user). I'll remove the line.

ajdapretnar requested a review from PrimozGodec June 22, 2021 11:54

ajdapretnar force-pushed the conllu branch from 730873b to ffacabb Compare June 22, 2021 14:02

ajdapretnar force-pushed the conllu branch 3 times, most recently from 3c08413 to da6b89a Compare June 30, 2021 09:40

ajdapretnar changed the title ~~Import Documents: Add conllu reader~~ [WIP] Import Documents: Add conllu reader Jul 6, 2021

ajdapretnar force-pushed the conllu branch from 35ca5c9 to 7760c8f Compare July 13, 2021 13:09

ajdapretnar force-pushed the conllu branch from 7760c8f to 650ebac Compare July 13, 2021 14:10

PrimozGodec requested changes Jul 19, 2021

View reviewed changes

orangecontrib/text/import_documents.py Outdated Show resolved Hide resolved

orangecontrib/text/import_documents.py Outdated Show resolved Hide resolved

orangecontrib/text/widgets/owimportdocuments.py Outdated Show resolved Hide resolved

ajdapretnar force-pushed the conllu branch from 650ebac to 63ef366 Compare July 19, 2021 09:13

ajdapretnar mentioned this pull request Jul 19, 2021

Preprocess: Reset tags on tokenization #686

Merged

3 tasks

PrimozGodec reviewed Jul 19, 2021

View reviewed changes

orangecontrib/text/import_documents.py Outdated Show resolved Hide resolved

ajdapretnar force-pushed the conllu branch from 63ef366 to e44a7f3 Compare July 19, 2021 11:30

ajdapretnar changed the title ~~[WIP] Import Documents: Add conllu reader~~ Import Documents: Add conllu reader Jul 19, 2021

PrimozGodec approved these changes Jul 19, 2021

View reviewed changes

VesnaT approved these changes Jul 22, 2021

View reviewed changes

ajdapretnar added 6 commits July 22, 2021 14:39

Import Documents: Add conllu reader

5cdd52e

Add metadata reader for conllu

d836ebf

Import Documents: Interface for conllu

f235c2f

Conllu reader: tests

8c44855

Don't crash on empty path

7d5bf90

[DOC] Import Documents: Support conllu

68b4408

ajdapretnar force-pushed the conllu branch from b95c571 to 68b4408 Compare July 22, 2021 12:39

VesnaT merged commit b093ab5 into biolab:master Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Documents: Add conllu reader #675

Import Documents: Add conllu reader #675

ajdapretnar commented Jun 22, 2021 •

edited

Loading

ajdapretnar commented Jul 13, 2021

ajdapretnar commented Jul 13, 2021

codecov-commenter commented Jul 16, 2021 •

edited

Loading

VesnaT Jul 22, 2021

ajdapretnar Jul 22, 2021

Import Documents: Add conllu reader #675

Import Documents: Add conllu reader #675

Conversation

ajdapretnar commented Jun 22, 2021 • edited Loading

Issue

Description of changes

Includes

TODO

ajdapretnar commented Jul 13, 2021

ajdapretnar commented Jul 13, 2021

codecov-commenter commented Jul 16, 2021 • edited Loading

Codecov Report

VesnaT Jul 22, 2021

Choose a reason for hiding this comment

ajdapretnar Jul 22, 2021

Choose a reason for hiding this comment

ajdapretnar commented Jun 22, 2021 •

edited

Loading

codecov-commenter commented Jul 16, 2021 •

edited

Loading