Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH][WIP] Add UDPipe Lemmatizer #367

Merged
merged 7 commits into from
Sep 12, 2018

Conversation

robertcv
Copy link
Collaborator

@robertcv robertcv commented Aug 3, 2018

Issue

Partially implements #298.

Description of changes

Add UDPipe Lemmatizer to preprocessing.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov-io
Copy link

codecov-io commented Aug 3, 2018

Codecov Report

Merging #367 into master will increase coverage by 0.11%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master     #367      +/-   ##
==========================================
+ Coverage   85.32%   85.44%   +0.11%     
==========================================
  Files          34       34              
  Lines        1881     1951      +70     
  Branches      337      344       +7     
==========================================
+ Hits         1605     1667      +62     
- Misses        237      243       +6     
- Partials       39       41       +2

@ajdapretnar
Copy link
Collaborator

This looks really good so far. Once the pipeline is agreed upon, I would in this batch also introduce other languages. I really like the fact that resources are downloaded upon the selection of a language, which means if I don't ever use, say Coptic, it won't be downloaded locally.

Please check slo-opinion-lexicon document 7 with UDPipe tokenization. I don't understand why this tokenization considers fullstops at the end as a part of the token. Npr. "narediti.". In document 4, for example, it doesn't do that. Could it be that sentence parsing isn't working well?

Before merging it would be nice to:

  • add tests (I am sure this was already planned)
  • update documentation! (I will do this part)
  • check saving the report (tokenizer is not overridden when UDPipe is selected)

@ajdapretnar ajdapretnar requested a review from nikicc August 6, 2018 08:39
@robertcv robertcv force-pushed the enh/udpipe_lemmatizer branch 2 times, most recently from ab07afe to 6efd69c Compare September 4, 2018 05:21
@robertcv robertcv force-pushed the enh/udpipe_lemmatizer branch from 6efd69c to d64ed16 Compare September 4, 2018 06:27
@ajdapretnar
Copy link
Collaborator

@PrimozGodec This works well for me. Would you mind giving a look at the code and if everything looks fine, merge?

Copy link
Collaborator

@PrimozGodec PrimozGodec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine to me.

@ajdapretnar ajdapretnar merged commit 400859d into biolab:master Sep 12, 2018
@robertcv robertcv deleted the enh/udpipe_lemmatizer branch November 19, 2019 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants