[FIX] preprocess: Use default tokenizer when None #294

lanzagar · 2017-07-28T14:23:23Z

Issue

A preprocessor can be constructed with tokenizer=None (actually, this is the default!).
In this case preprocessing produces a single token with the complete text. This can cause many problems and is probably never desired/useful.

Description of changes

If tokenizer=None, use the same default tokenizer as is used in base_preprocessor, which is called when accessing Corpus.tokens without previous explicit preprocessing.
Move base_preprocessor into the preprocess.py module, right after the class Preprocess (of which it is an instance).

Includes

Code changes
Tests
Documentation

codecov-io · 2017-07-31T10:13:01Z

Codecov Report

Merging #294 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
+ Coverage   85.02%   85.04%   +0.01%     
==========================================
  Files          34       34              
  Lines        1823     1825       +2     
  Branches      333      333              
==========================================
+ Hits         1550     1552       +2     
  Misses        238      238              
  Partials       35       35

nikicc

Should we put this into the documentation also? Something like tokenizer (BaseTokenizer or None): tokenizes string, uses WordPunctTokenizer when None is given?

preprocess: Use default tokenizer when None

4448d9b

lanzagar assigned nikicc Jul 28, 2017

test_preprocess: Change silly tests

f93d2b9

nikicc approved these changes Aug 2, 2017

View reviewed changes

nikicc merged commit 1639182 into biolab:master Aug 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] preprocess: Use default tokenizer when None #294

[FIX] preprocess: Use default tokenizer when None #294

lanzagar commented Jul 28, 2017

codecov-io commented Jul 31, 2017 •

edited

Loading

nikicc left a comment

[FIX] preprocess: Use default tokenizer when None #294

[FIX] preprocess: Use default tokenizer when None #294

Conversation

lanzagar commented Jul 28, 2017

Issue

Description of changes

Includes

codecov-io commented Jul 31, 2017 • edited Loading

Codecov Report

nikicc left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 31, 2017 •

edited

Loading