Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Refactor preprocessors #506

Merged
merged 3 commits into from
May 28, 2020
Merged

[ENH] Refactor preprocessors #506

merged 3 commits into from
May 28, 2020

Conversation

VesnaT
Copy link
Contributor

@VesnaT VesnaT commented Mar 19, 2020

Issue

Implements #210

Description of changes

TODO - implement StanfordPOSTagger

Includes
  • Code changes
  • Tests
  • Documentation

@VesnaT VesnaT changed the title [WIP] Refactor preprocessors [ENH] Refactor preprocessors Mar 23, 2020
@ajdapretnar
Copy link
Collaborator

This looks real good! 🎉

I have two comments.

  1. Bug in Heat Map. Corpus (bookexcerpts) - Preprocess (default) - Bag of Words - Heat Map (use Clustering).

Stack trace:

Traceback (most recent call last):  File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 281, in _    self.set_row_clustering(cb.itemData(idx, ClusteringRole))  File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 434, in set_row_clustering    self.__update_row_clustering()  File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 927, in __update_row_clustering    self.update_heatmaps()  File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 588, in update_heatmaps    parts = self.construct_heatmaps(self.data, self.split_by_var)  File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 742, in construct_heatmaps    ordered=self.row_clustering == Clustering.OrderedClustering  File "/Users/ajda/orange/orange3/Orange/widgets/visualize/owheatmap.py", line 642, in cluster_rows    subset = data[row.indices]  File "/Users/ajda/orange/orange3-text/orangecontrib/text/corpus.py", line 444, in __getitem__    Corpus.retain_preprocessing(self, c, key)  File "/Users/ajda/orange/orange3-text/orangecontrib/text/corpus.py", line 508, in retain_preprocessing    raise TypeError('Indexing by type {} not supported.'.format(type(key)))TypeError: Indexing by type <class 'range'> not supported.

  1. Preview, which is such a fantastic feature btw, doesn't update on NGrams or POSTags. I assume it should?

The rest - so far, so good. I haven't tried anything super exotic, but the basic functionality is well-supported.

@ajdapretnar
Copy link
Collaborator

Oh, one more thing. The unchecked 'auto apply' box says Send Selection. Probably not a good option. 😆 Perhaps Apply or something?

@VesnaT
Copy link
Contributor Author

VesnaT commented Mar 30, 2020

Preview, which is such a fantastic feature btw, doesn't update on NGrams or POSTags. I assume it should?

Should it? It probably should. The preview displays tokens. Maybe we should consider a different implementation of n-grams - to store them within tokens.

@ajdapretnar
Copy link
Collaborator

They are visible in Corpus Viewer (Show tokens and tags). So I guess preview should show them, too?

@VesnaT
Copy link
Contributor Author

VesnaT commented Mar 30, 2020

Agree.
But there are several problems with the current implementation of n-grams. For instance, if one places the n-grams preprocessor before filtering, he expects to filter n-grams. The current implementation would filter tokens and apply n-grams on top of it.

@ajdapretnar
Copy link
Collaborator

Scrap that bug, it is present in master, too.

@PrimozGodec
Copy link
Collaborator

Regrading n-grams I agree that now when preprocessing works different (custom preprocessor order) n-grams should be stored as tokens. We discussed this option while ago already. I think it could be changed but we need to check a widget that uses tokens whether they can work with n-grams.
This, in my opinion, should not be part of this PR.

@ajdapretnar
Copy link
Collaborator

After some painstaking checks, I noticed that with this PR compute values are not considered correctly.
on master: If I remove stopwords, they are removed in the 'prediction data' as well.
on branch: Stopwords are not removed.

NGrams and POStags don't work either, but these didn't work on master anyway. Both I believe for obvious reasons - they are not in Dictionary, which is the key to building compute_value.

I think it is super important to handle domain transformations properly, but this is still a bit confusing to me. We should also check, not only vectorizers (which should implement some version of compute_value anyway), but also other 'embedders', i.e. Sentiment Analysis, Topic Modeling, Tweet Profiler??).

@VesnaT
Copy link
Contributor Author

VesnaT commented Mar 30, 2020

Thanks for the review. I will look into the compute values in more detail.

@ajdapretnar
Copy link
Collaborator

Perhaps (most likely), this is beyond the scope of this PR. This PR (or perhaps a later update) should consider a different Dictionary. We have to separately decide how to handle all the options for predictions in Text.

@VesnaT
Copy link
Contributor Author

VesnaT commented Apr 1, 2020

The auto apply box will be fixed in biolab/orange3#4605

@VesnaT VesnaT force-pushed the owpp branch 2 times, most recently from a528d3f to 3ca8db1 Compare April 20, 2020 11:40
@codecov-io
Copy link

Codecov Report

Merging #506 into master will increase coverage by 0.58%.
The diff coverage is 88.53%.

@@            Coverage Diff             @@
##           master     #506      +/-   ##
==========================================
+ Coverage   63.98%   64.56%   +0.58%     
==========================================
  Files          60       60              
  Lines        6341     6463     +122     
  Branches      830      857      +27     
==========================================
+ Hits         4057     4173     +116     
  Misses       2149     2149              
- Partials      135      141       +6     

@ajdapretnar
Copy link
Collaborator

@VesnaT Don't rebase just yet. I am trying to bring in other PRs before this one, so we can make a release with the easier widgets, then merge this one, since it is a huge change.

@VesnaT VesnaT force-pushed the owpp branch 2 times, most recently from fcd5426 to 0651721 Compare May 4, 2020 08:37
@PrimozGodec
Copy link
Collaborator

PrimozGodec commented May 19, 2020

Is this one finished?

Is this already implemented?

TODO - implement StanfordPOSTagger

@ajdapretnar
Copy link
Collaborator

I think so. I had a couple of minor comments, but nothing major so far. I was testing a bit with the changes and it seemed to work. :)

@VesnaT
Copy link
Contributor Author

VesnaT commented May 28, 2020

StanfordPOSTagger has not been implemented.
We agreed to upload files to a server (instead of downloading them locally - if i remember correctly), so this needs to be done first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants