Statistics: documentation

biolab · Apr 17, 2020 · 69d696c · 69d696c
1 parent 5221714
commit 69d696c
Show file tree

Hide file tree

Showing 5 changed files with 52 additions and 0 deletions.
diff --git a/doc/index.rst b/doc/index.rst
@@ -26,6 +26,7 @@ Widgets
    widgets/docmap
    widgets/wordenrichment
    widgets/duplicatedetection
+   widgets/statistics
 
 Scripting
 ---------

diff --git a/doc/widgets.json b/doc/widgets.json
@@ -146,6 +146,13 @@
     "icon": "../orangecontrib/text/widgets/icons/Duplicates.svg",
     "background": "light-blue",
     "keywords": []
+   },
+   {
+    "text": "Statistics",
+    "doc": "widgets/statistics.md",
+    "icon": "../orangecontrib/text/widgets/icons/Statistics.svg",
+    "background": "light-blue",
+    "keywords": []
    }
   ]
  ]

diff --git a/doc/widgets/images/statistics-example.png b/doc/widgets/images/statistics-example.png
diff --git a/doc/widgets/images/statistics-stamped.png b/doc/widgets/images/statistics-stamped.png
diff --git a/doc/widgets/statistics.md b/doc/widgets/statistics.md
@@ -0,0 +1,44 @@
+Statistics
+==========
+
+Create new statistic variables for documents.
+
+**Inputs**
+
+- Corpus: A collection of documents.
+
+**Outputs**
+
+- Corpus: Corpus with additional attributes.
+
+**Statistics** is a feature constructor widget that adds simple document statistics to a corpus. It supports both standard statistical measures and user-defined variables.
+
+![](images/statistics-stamped.png)
+
+1. Add or remove features. Features can be added with the + sign below. They can be removed with the x sign on the left side. Feature options are:
+   - Words count: number of words in the document.
+   - Characters count: number of characters in the document.
+   - N-grams count: number of n-grams. Define n-grams in [Preprocess Text], otherwise only unigrams will be reported.
+   - Average word length: ratio between character count and the number of words
+   - Punctuations count: number of punctuations
+   - Capitals count: number of capital letters
+   - Vowels count: number of vowels. The default is 'a, e, i, o, u', but the user can add her own.
+   - Consonants count: number of consonants. Default is given, but the user can adjust it.
+   - Per cent unique words: ratio of unique words to all the words (types/tokens).
+   - Starts with: number of times a token begins with the specified sequence.
+   - Ends with: number of times a token ends with the specified sequence.
+   - Contains: number of times a specified sequence is in the token.
+   - Regex: number of times the provided regular expression matches the token.
+   - POS tag: count specified POS tags. Requires POS tagged tokens from [Preprocess Text](preprocesstext.md). List of Tree POS tags for English can be found [here](https://courses.washington.edu/hypertxt/csar-v02/penntable.html).
+
+2. Press Apply to output corpus with new features.
+3. Status line with help on the left and input and output on the right.
+
+Example
+-------
+
+Here is a simple example how **Statistics** widget works. As it is a basic feature construction widget, it can be used directly after [Corpus](corpus-widget.md). We have added a couple of features, namely word count, character count, percent unique words and number of words containing 'oran'. We can observe the table with additional columns in a [Data Table](https://orange-visual-programming.readthedocs.io/widgets/data/datatable.html).
+
+We can also use the output of Statistics for predictive modeling with [Test and Score](https://orange-visual-programming.readthedocs.io/widgets/evaluate/testandscore.html). Normally, however, we would use Statistics only to enhance features from the [Bag of Words](bagofwords-widget.md) widget. Some features require POS tagged tokens, which can be created with [Preprocess Text](preprocesstext.md) widget.
+
+![](images/statistics-example.png)