Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] preprocess: Use default tokenizer when None #294

Merged
merged 2 commits into from
Aug 2, 2017

Conversation

lanzagar
Copy link
Contributor

Issue

A preprocessor can be constructed with tokenizer=None (actually, this is the default!).
In this case preprocessing produces a single token with the complete text. This can cause many problems and is probably never desired/useful.

Description of changes
  1. If tokenizer=None, use the same default tokenizer as is used in base_preprocessor, which is called when accessing Corpus.tokens without previous explicit preprocessing.
  2. Move base_preprocessor into the preprocess.py module, right after the class Preprocess (of which it is an instance).
Includes
  • Code changes
  • Tests
  • Documentation

@codecov-io
Copy link

codecov-io commented Jul 31, 2017

Codecov Report

Merging #294 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
+ Coverage   85.02%   85.04%   +0.01%     
==========================================
  Files          34       34              
  Lines        1823     1825       +2     
  Branches      333      333              
==========================================
+ Hits         1550     1552       +2     
  Misses        238      238              
  Partials       35       35

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this into the documentation also? Something like tokenizer (BaseTokenizer or None): tokenizes string, uses WordPunctTokenizer when None is given?

@nikicc nikicc merged commit 1639182 into biolab:master Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants