Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Statistics widget #503

Merged
merged 3 commits into from
Apr 22, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ matrix:
env: ORANGE="master"
- &orange3-21-0
python: '3.7'
env: ORANGE="3.21.0"
env: ORANGE="3.24.0"

env:
global:
Expand Down
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Widgets
widgets/docmap
widgets/wordenrichment
widgets/duplicatedetection
widgets/statistics

Scripting
---------
Expand Down
7 changes: 7 additions & 0 deletions doc/widgets.json
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,13 @@
"icon": "../orangecontrib/text/widgets/icons/Duplicates.svg",
"background": "light-blue",
"keywords": []
},
{
"text": "Statistics",
"doc": "widgets/statistics.md",
"icon": "../orangecontrib/text/widgets/icons/Statistics.svg",
"background": "light-blue",
"keywords": []
}
]
]
Expand Down
Binary file added doc/widgets/images/statistics-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/statistics-stamped.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions doc/widgets/statistics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
Statistics
==========

Create new statistic variables for documents.

**Inputs**

- Corpus: A collection of documents.

**Outputs**

- Corpus: Corpus with additional attributes.

**Statistics** is a feature constructor widget that adds simple document statistics to a corpus. It supports both standard statistical measures and user-defined variables.

![](images/statistics-stamped.png)

1. Add or remove features. Features can be added with the + sign below. They can be removed with the x sign on the left side. Feature options are:
- Words count: number of words in the document.
- Characters count: number of characters in the document.
- N-grams count: number of n-grams. Define n-grams in [Preprocess Text], otherwise only unigrams will be reported.
- Average word length: ratio between character count and the number of words
- Punctuations count: number of punctuations
- Capitals count: number of capital letters
- Vowels count: number of vowels. The default is 'a, e, i, o, u', but the user can add her own.
- Consonants count: number of consonants. Default is given, but the user can adjust it.
- Per cent unique words: ratio of unique words to all the words (types/tokens).
- Starts with: number of times a token begins with the specified sequence.
- Ends with: number of times a token ends with the specified sequence.
- Contains: number of times a specified sequence is in the token.
- Regex: number of times the provided regular expression matches the token.
- POS tag: count specified POS tags. Requires POS tagged tokens from [Preprocess Text](preprocesstext.md). List of Tree POS tags for English can be found [here](https://courses.washington.edu/hypertxt/csar-v02/penntable.html).

2. Press Apply to output corpus with new features.
3. Status line with help on the left and input and output on the right.

Example
-------

Here is a simple example how **Statistics** widget works. As it is a basic feature construction widget, it can be used directly after [Corpus](corpus-widget.md). We have added a couple of features, namely word count, character count, percent unique words and number of words containing 'oran'. We can observe the table with additional columns in a [Data Table](https://orange-visual-programming.readthedocs.io/widgets/data/datatable.html).

We can also use the output of Statistics for predictive modeling with [Test and Score](https://orange-visual-programming.readthedocs.io/widgets/evaluate/testandscore.html). Normally, however, we would use Statistics only to enhance features from the [Bag of Words](bagofwords-widget.md) widget. Some features require POS tagged tokens, which can be created with [Preprocess Text](preprocesstext.md) widget.

![](images/statistics-example.png)
99 changes: 99 additions & 0 deletions orangecontrib/text/widgets/icons/Statistics.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading