-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Refactor preprocessors #506
Conversation
This looks real good! 🎉 I have two comments.
Stack trace:
The rest - so far, so good. I haven't tried anything super exotic, but the basic functionality is well-supported. |
Oh, one more thing. The unchecked 'auto apply' box says Send Selection. Probably not a good option. 😆 Perhaps Apply or something? |
Should it? It probably should. The preview displays tokens. Maybe we should consider a different implementation of n-grams - to store them within tokens. |
They are visible in Corpus Viewer (Show tokens and tags). So I guess preview should show them, too? |
Agree. |
Scrap that bug, it is present in master, too. |
Regrading n-grams I agree that now when preprocessing works different (custom preprocessor order) n-grams should be stored as tokens. We discussed this option while ago already. I think it could be changed but we need to check a widget that uses |
After some painstaking checks, I noticed that with this PR compute values are not considered correctly. NGrams and POStags don't work either, but these didn't work on master anyway. Both I believe for obvious reasons - they are not in Dictionary, which is the key to building compute_value. I think it is super important to handle domain transformations properly, but this is still a bit confusing to me. We should also check, not only vectorizers (which should implement some version of compute_value anyway), but also other 'embedders', i.e. Sentiment Analysis, Topic Modeling, Tweet Profiler??). |
Thanks for the review. I will look into the compute values in more detail. |
Perhaps (most likely), this is beyond the scope of this PR. This PR (or perhaps a later update) should consider a different Dictionary. We have to separately decide how to handle all the options for predictions in Text. |
The auto apply box will be fixed in biolab/orange3#4605 |
a528d3f
to
3ca8db1
Compare
Codecov Report
@@ Coverage Diff @@
## master #506 +/- ##
==========================================
+ Coverage 63.98% 64.56% +0.58%
==========================================
Files 60 60
Lines 6341 6463 +122
Branches 830 857 +27
==========================================
+ Hits 4057 4173 +116
Misses 2149 2149
- Partials 135 141 +6 |
@VesnaT Don't rebase just yet. I am trying to bring in other PRs before this one, so we can make a release with the easier widgets, then merge this one, since it is a huge change. |
fcd5426
to
0651721
Compare
Is this one finished? Is this already implemented?
|
I think so. I had a couple of minor comments, but nothing major so far. I was testing a bit with the changes and it seemed to work. :) |
StanfordPOSTagger has not been implemented. |
Issue
Implements #210
Description of changes
TODO - implement StanfordPOSTagger
Includes