[ENH] Refactor preprocessors #506

VesnaT · 2020-03-19T19:21:04Z

Issue

Implements #210

Description of changes

TODO - implement StanfordPOSTagger

Includes

Code changes
Tests
Documentation

ajdapretnar · 2020-03-30T08:53:51Z

This looks real good! 🎉

I have two comments.

Bug in Heat Map. Corpus (bookexcerpts) - Preprocess (default) - Bag of Words - Heat Map (use Clustering).

Stack trace:

Traceback (most recent call last): File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 281, in _ self.set_row_clustering(cb.itemData(idx, ClusteringRole)) File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 434, in set_row_clustering self.__update_row_clustering() File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 927, in __update_row_clustering self.update_heatmaps() File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 588, in update_heatmaps parts = self.construct_heatmaps(self.data, self.split_by_var) File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 742, in construct_heatmaps ordered=self.row_clustering == Clustering.OrderedClustering File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 642, in cluster_rows subset = data[row.indices] File "/Users/ajda/orange/orange3-text/orangecontrib/text/corpus.py", line 444, in __getitem__ Corpus.retain_preprocessing(self, c, key) File "/Users/ajda/orange/orange3-text/orangecontrib/text/corpus.py", line 508, in retain_preprocessing raise TypeError('Indexing by type {} not supported.'.format(type(key)))TypeError: Indexing by type <class 'range'> not supported.

Preview, which is such a fantastic feature btw, doesn't update on NGrams or POSTags. I assume it should?

The rest - so far, so good. I haven't tried anything super exotic, but the basic functionality is well-supported.

ajdapretnar · 2020-03-30T08:58:09Z

Oh, one more thing. The unchecked 'auto apply' box says Send Selection. Probably not a good option. 😆 Perhaps Apply or something?

VesnaT · 2020-03-30T09:02:39Z

Preview, which is such a fantastic feature btw, doesn't update on NGrams or POSTags. I assume it should?

Should it? It probably should. The preview displays tokens. Maybe we should consider a different implementation of n-grams - to store them within tokens.

ajdapretnar · 2020-03-30T09:06:18Z

They are visible in Corpus Viewer (Show tokens and tags). So I guess preview should show them, too?

VesnaT · 2020-03-30T09:11:43Z

Agree.
But there are several problems with the current implementation of n-grams. For instance, if one places the n-grams preprocessor before filtering, he expects to filter n-grams. The current implementation would filter tokens and apply n-grams on top of it.

ajdapretnar · 2020-03-30T09:47:58Z

Scrap that bug, it is present in master, too.

PrimozGodec · 2020-03-30T10:40:53Z

Regrading n-grams I agree that now when preprocessing works different (custom preprocessor order) n-grams should be stored as tokens. We discussed this option while ago already. I think it could be changed but we need to check a widget that uses tokens whether they can work with n-grams.
This, in my opinion, should not be part of this PR.

ajdapretnar · 2020-03-30T13:55:33Z

After some painstaking checks, I noticed that with this PR compute values are not considered correctly.
on master: If I remove stopwords, they are removed in the 'prediction data' as well.
on branch: Stopwords are not removed.

NGrams and POStags don't work either, but these didn't work on master anyway. Both I believe for obvious reasons - they are not in Dictionary, which is the key to building compute_value.

I think it is super important to handle domain transformations properly, but this is still a bit confusing to me. We should also check, not only vectorizers (which should implement some version of compute_value anyway), but also other 'embedders', i.e. Sentiment Analysis, Topic Modeling, Tweet Profiler??).

VesnaT · 2020-03-30T14:28:37Z

Thanks for the review. I will look into the compute values in more detail.

ajdapretnar · 2020-03-30T14:41:38Z

Perhaps (most likely), this is beyond the scope of this PR. This PR (or perhaps a later update) should consider a different Dictionary. We have to separately decide how to handle all the options for predictions in Text.

VesnaT · 2020-04-01T06:36:23Z

The auto apply box will be fixed in biolab/orange3#4605

codecov-io · 2020-04-20T11:47:43Z

Codecov Report

Merging #506 into master will increase coverage by 0.58%.
The diff coverage is 88.53%.

@@            Coverage Diff             @@
##           master     #506      +/-   ##
==========================================
+ Coverage   63.98%   64.56%   +0.58%     
==========================================
  Files          60       60              
  Lines        6341     6463     +122     
  Branches      830      857      +27     
==========================================
+ Hits         4057     4173     +116     
  Misses       2149     2149              
- Partials      135      141       +6

ajdapretnar · 2020-04-22T08:49:05Z

@VesnaT Don't rebase just yet. I am trying to bring in other PRs before this one, so we can make a release with the easier widgets, then merge this one, since it is a huge change.

PrimozGodec · 2020-05-19T12:55:56Z

Is this one finished?

Is this already implemented?

TODO - implement StanfordPOSTagger

ajdapretnar · 2020-05-19T14:12:29Z

I think so. I had a couple of minor comments, but nothing major so far. I was testing a bit with the changes and it seemed to work. :)

VesnaT · 2020-05-28T06:52:16Z

StanfordPOSTagger has not been implemented.
We agreed to upload files to a server (instead of downloading them locally - if i remember correctly), so this needs to be done first.

VesnaT force-pushed the owpp branch from d410a53 to 7bd2dc7 Compare March 23, 2020 08:12

VesnaT changed the title ~~[WIP] Refactor preprocessors~~ [ENH] Refactor preprocessors Mar 23, 2020

VesnaT force-pushed the owpp branch 2 times, most recently from a528d3f to 3ca8db1 Compare April 20, 2020 11:40

VesnaT force-pushed the owpp branch 2 times, most recently from fcd5426 to 0651721 Compare May 4, 2020 08:37

VesnaT added 3 commits May 4, 2020 11:26

Refactor preprocessors

695c4d4

Preprocess: Show n-grams in preview

6103102

Preprocess: Save used preprocessors for compute_value

d1755b0

VesnaT force-pushed the owpp branch from 0651721 to d1755b0 Compare May 4, 2020 09:27

PrimozGodec mentioned this pull request May 4, 2020

text processing module with slow internet conection #519

Closed

ajdapretnar merged commit da59f58 into biolab:master May 28, 2020

ajdapretnar mentioned this pull request May 28, 2020

Preprocess Text: Refactor #210

Closed

PrimozGodec mentioned this pull request Nov 6, 2023

POS - remove unsupported and unused StanfordPOSTagger #1021

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Refactor preprocessors #506

[ENH] Refactor preprocessors #506

VesnaT commented Mar 19, 2020

ajdapretnar commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Mar 30, 2020 •

edited

Loading

ajdapretnar commented Mar 30, 2020

PrimozGodec commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Apr 1, 2020 •

edited

Loading

codecov-io commented Apr 20, 2020

ajdapretnar commented Apr 22, 2020

PrimozGodec commented May 19, 2020 •

edited

Loading

ajdapretnar commented May 19, 2020

VesnaT commented May 28, 2020

[ENH] Refactor preprocessors #506

[ENH] Refactor preprocessors #506

Conversation

VesnaT commented Mar 19, 2020

Issue

Description of changes

Includes

ajdapretnar commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Mar 30, 2020 • edited Loading

ajdapretnar commented Mar 30, 2020

PrimozGodec commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Mar 30, 2020

ajdapretnar commented Mar 30, 2020

VesnaT commented Apr 1, 2020 • edited Loading

codecov-io commented Apr 20, 2020

Codecov Report

ajdapretnar commented Apr 22, 2020

PrimozGodec commented May 19, 2020 • edited Loading

ajdapretnar commented May 19, 2020

VesnaT commented May 28, 2020

VesnaT commented Mar 30, 2020 •

edited

Loading

VesnaT commented Apr 1, 2020 •

edited

Loading

PrimozGodec commented May 19, 2020 •

edited

Loading