-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH][WIP] Add UDPipe Lemmatizer #367
Conversation
Codecov Report
@@ Coverage Diff @@
## master #367 +/- ##
==========================================
+ Coverage 85.32% 85.44% +0.11%
==========================================
Files 34 34
Lines 1881 1951 +70
Branches 337 344 +7
==========================================
+ Hits 1605 1667 +62
- Misses 237 243 +6
- Partials 39 41 +2 |
This looks really good so far. Once the pipeline is agreed upon, I would in this batch also introduce other languages. I really like the fact that resources are downloaded upon the selection of a language, which means if I don't ever use, say Coptic, it won't be downloaded locally. Please check slo-opinion-lexicon document 7 with UDPipe tokenization. I don't understand why this tokenization considers fullstops at the end as a part of the token. Npr. "narediti.". In document 4, for example, it doesn't do that. Could it be that sentence parsing isn't working well? Before merging it would be nice to:
|
ab07afe
to
6efd69c
Compare
Additionally to the code changes, all UDPipe models were add on the server.
6efd69c
to
d64ed16
Compare
@PrimozGodec This works well for me. Would you mind giving a look at the code and if everything looks fine, merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine to me.
Issue
Partially implements #298.
Description of changes
Add UDPipe Lemmatizer to preprocessing.
Includes