[ENH][WIP] Add UDPipe Lemmatizer #367

robertcv · 2018-08-03T18:03:09Z

Issue

Partially implements #298.

Description of changes

Add UDPipe Lemmatizer to preprocessing.

Includes

Code changes
Tests
Documentation

codecov-io · 2018-08-03T18:14:26Z

Codecov Report

Merging #367 into master will increase coverage by 0.11%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master     #367      +/-   ##
==========================================
+ Coverage   85.32%   85.44%   +0.11%     
==========================================
  Files          34       34              
  Lines        1881     1951      +70     
  Branches      337      344       +7     
==========================================
+ Hits         1605     1667      +62     
- Misses        237      243       +6     
- Partials       39       41       +2

ajdapretnar · 2018-08-06T08:38:02Z

This looks really good so far. Once the pipeline is agreed upon, I would in this batch also introduce other languages. I really like the fact that resources are downloaded upon the selection of a language, which means if I don't ever use, say Coptic, it won't be downloaded locally.

Please check slo-opinion-lexicon document 7 with UDPipe tokenization. I don't understand why this tokenization considers fullstops at the end as a part of the token. Npr. "narediti.". In document 4, for example, it doesn't do that. Could it be that sentence parsing isn't working well?

Before merging it would be nice to:

add tests (I am sure this was already planned)
update documentation! (I will do this part)
check saving the report (tokenizer is not overridden when UDPipe is selected)

Additionally to the code changes, all UDPipe models were add on the server.

ajdapretnar · 2018-09-06T10:01:12Z

@PrimozGodec This works well for me. Would you mind giving a look at the code and if everything looks fine, merge?

PrimozGodec

The code looks fine to me.

ajdapretnar requested a review from nikicc August 6, 2018 08:39

robertcv force-pushed the enh/udpipe_lemmatizer branch 2 times, most recently from ab07afe to 6efd69c Compare September 4, 2018 05:21

robertcv added 6 commits September 4, 2018 08:25

preprocess: add UDPipe lemmatizer

776233a

owpreprocess: add UDPipe lemmatizer to widget

fd6574c

owpreprocess: load saved normalization settings

18e1e86

preprocess: add support for all UDPipe models

66aa493

Additionally to the code changes, all UDPipe models were add on the server.

tests: add udpipe lemmatizer tests

a4e2daf

preprocess: add udpipe tokenizer to report

d64ed16

robertcv force-pushed the enh/udpipe_lemmatizer branch from 6efd69c to d64ed16 Compare September 4, 2018 06:27

preprocess: load UDPipe model on normalization

87a164b

ajdapretnar requested a review from PrimozGodec September 6, 2018 08:05

ajdapretnar approved these changes Sep 6, 2018

View reviewed changes

PrimozGodec approved these changes Sep 12, 2018

View reviewed changes

ajdapretnar merged commit 400859d into biolab:master Sep 12, 2018

robertcv deleted the enh/udpipe_lemmatizer branch November 19, 2019 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH][WIP] Add UDPipe Lemmatizer #367

[ENH][WIP] Add UDPipe Lemmatizer #367

robertcv commented Aug 3, 2018 •

edited

Loading

codecov-io commented Aug 3, 2018 •

edited

Loading

ajdapretnar commented Aug 6, 2018

ajdapretnar commented Sep 6, 2018

PrimozGodec left a comment

[ENH][WIP] Add UDPipe Lemmatizer #367

[ENH][WIP] Add UDPipe Lemmatizer #367

Conversation

robertcv commented Aug 3, 2018 • edited Loading

Issue

Description of changes

Includes

codecov-io commented Aug 3, 2018 • edited Loading

Codecov Report

ajdapretnar commented Aug 6, 2018

ajdapretnar commented Sep 6, 2018

PrimozGodec left a comment

Choose a reason for hiding this comment

robertcv commented Aug 3, 2018 •

edited

Loading

codecov-io commented Aug 3, 2018 •

edited

Loading