-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Documents: Add conllu reader #675
Conversation
3c08413
to
da6b89a
Compare
The PR has been tested on several different conllu files: It can handle mixed data types (i.e. conllu & txt), but in this case, there will be no preprocessing (it will ignore lemmas, tags and NER). It would be nice to add a Conllu reader to Corpus, to read single conllu files. But this is out of the scope of this PR. |
Major issue still to be solved: how to handle named entities? Currently, they appear as a comma-separated string, which is less than ideal. The user then has to once again tokenize it to count the appearances. Should we create a new Corpus property and do something with it? 🤔 |
Codecov Report
@@ Coverage Diff @@
## master #675 +/- ##
==========================================
+ Coverage 74.09% 74.33% +0.23%
==========================================
Files 72 72
Lines 9230 9362 +132
Branches 1253 1276 +23
==========================================
+ Hits 6839 6959 +120
- Misses 2149 2153 +4
- Partials 242 250 +8 |
@@ -635,10 +661,27 @@ def __onReportProgress(self, arg): | |||
self.pathlabel.setText(prettifypath(arg.lastpath)) | |||
self.progress_widget.setValue(int(100 * arg.progress)) | |||
|
|||
def add_features(self): | |||
self.Warning.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is clearing the warnings here intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, it as probably left here from when there was a warning about missing lemmas (if the user selected only POS tags and not lemmas). Now we removed the warning and POS tags are loaded without lemmas - whatever that may mean for downstream preprocessing (leave it to the user). I'll remove the line.
Issue
Conllu is a very standard format for text analysis and Orange didn't support it yet.
Description of changes
Includes
TODO