Merge pull request #828 from ajdapretnar/new-widget-doc

Documentation for several widgets
biolab · Apr 25, 2022 · f0e8c60 · f0e8c60
2 parents 1d9bbce + 9535acf
commit f0e8c60
Show file tree

Hide file tree

Showing 12 changed files with 120 additions and 5 deletions.
diff --git a/doc/index.rst b/doc/index.rst
@@ -33,6 +33,7 @@ Widgets
    widgets/keywords
    widgets/score-documents
    widgets/semanticviewer
+   widgets/wordlist
 
 Scripting
 ---------

diff --git a/doc/widgets.json b/doc/widgets.json
@@ -42,6 +42,16 @@
     "doc": "widgets/twitter-widget.md",
     "icon": "../orangecontrib/text/widgets/icons/Twitter.svg",
     "background": "light-blue",
+    "keywords": [
+     "twitter",
+     "tweet"
+    ]
+   },
+   {
+    "text": "Wikipedia",
+    "doc": "widgets/wikipedia-widget.md",
+    "icon": "../orangecontrib/text/widgets/icons/Wikipedia.svg",
+    "background": "light-blue",
     "keywords": []
    },
    {
@@ -176,7 +186,7 @@
    },
    {
     "text": "Word List",
-    "doc": null,
+    "doc": "widgets/wordlist.md",
     "icon": "../orangecontrib/text/widgets/icons/WordList.svg",
     "background": "light-blue",
     "keywords": []

diff --git a/doc/widgets/images/Extract-Keywords.png b/doc/widgets/images/Extract-Keywords.png
diff --git a/doc/widgets/images/Score-Documents-Example.png b/doc/widgets/images/Score-Documents-Example.png
diff --git a/doc/widgets/images/Score-Documents.png b/doc/widgets/images/Score-Documents.png
diff --git a/doc/widgets/images/Semantic-Viewer-Example.png b/doc/widgets/images/Semantic-Viewer-Example.png
diff --git a/doc/widgets/images/Semantic-Viewer.png b/doc/widgets/images/Semantic-Viewer.png
diff --git a/doc/widgets/images/Word-List-Union.png b/doc/widgets/images/Word-List-Union.png
diff --git a/doc/widgets/keywords.md b/doc/widgets/keywords.md
@@ -13,3 +13,31 @@ Infers characteristic words from the input corpus.
 - Words: A table of words.
 
 **Extract Keywords** infers characteristic words from corpus.
+
+![](images/Extract-Keywords.png)
+
+1. Scoring methods for extracting words:
+   - TF-IDF, a method that scores by term frequency weighted by inverse document frequency. A word that is characteristic for a small number of documents, will have a high TF-IDF score, while words that appear in the entire corpus will have a low score.
+   - [YAKE!](http://yake.inesctec.pt/), an unsupervised state-of-the-art method that works with texts of different sizes.
+   - [Rake](https://github.com/zelandiya/RAKE-tutorial), an unsupervised domain-independent method based around stopword delimiters.
+   - Embedding, a proprietary method that gives higher scores to words that have the closest cosine distance to most documents. The distance is computed on SBERT word embeddings.
+
+Example
+-------
+
+In the below example, we are using the *book-excerpts* corpus, which is available in the [Corpus](corpus-widget.md) widget. 
+
+We pass the corpus to [Preprocess Text](preprocesstext.md), where we lowercase the text, split it into words with tokenization, use Lemmagen lemmatizer to cover tokens to their base form and finally remove stopwords.
+
+Next, we find characteristic words with [Extract Keywords](keywords.md) widget using the TF-IDF method. The widget returns a list of words and we can select the top-ranked words to send to the output. 
+
+We can use these words in [Word List](wordlist.md), where we can edit them, add to them or remove them. Alternatively, we can send the candidate words directly to Semantic Viewer or Score Documents.
+
+![](images/Semantic-Viewer-Example.png)
+
+References
+----------
+
+Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In *Information Sciences Journal*. Elsevier, Vol 509, pp 257-289
+
+Rose, S., Engel, D., Cramer, N. and Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In *Text Mining* (eds M.W. Berry and J. Kogan). https://doi.org/10.1002/9780470689646.ch1
diff --git a/doc/widgets/score-documents.md b/doc/widgets/score-documents.md
@@ -20,6 +20,24 @@ Scores documents based on word appearance.
    - **Word frequency**: The count of the frequency of a word in the text.  
    - **Word ratio**: Indicate the appearance of the word in the document.
    - **Similarity**: The cosine similarity between document embedding and word embedding.
-2. Select aggregation function to aggregate word scores in document scores.
-3. Filter documents based on the document title in the first column
-4. The table with the document titles in the first column and scores for in other columns.
+2. Select aggregation function to aggregate word scores to document scores (mean, median, min or max).
+3. Select documents:
+   - None: no documents are on the output
+   - All: the entire corpus is on the output
+   - Manual: manually select the documents from the table
+   - Top documents: n top-scored documents are sent to the output
+4. If *Send Automatically*, changes are communicated automatically. Alternatively press *Send*.
+5. Filter documents based on the document title in the first column. Below is the table with the document titles in the first column and scores in other columns.
+
+Example
+-------
+
+Score Documents is used to find documents that are semantically similar to the input word list. In the example below, we are using the *book-excerpts* corpus from the [Corpus](corpus-widget.md) widget.
+
+We pass the corpus to [Preprocess Text](preprocesstext.md), where we lowercase the text, split it into words with tokenization, use Lemmagen lemmatizer to cover tokens to their base form and finally remove stopwords.
+
+Next, we find characteristic words with [Extract Keywords](keywords.md) widget and send these words to [Word List](wordlist.md). There, we add some of our own words, such as princess, prince, king and queen.
+
+Finally, we pass the preprocess corpus from Preprocess Text to Score Documents and the word list from Word List widget. Score Documents scores each document based on how frequently the input words appear in it.
+
+![](images/Score-Documents-Example.png)
diff --git a/doc/widgets/semanticviewer.md b/doc/widgets/semanticviewer.md
@@ -14,4 +14,23 @@ Displays corpus semantics.
 - Other Docs: Other documents.
 - Corpus: A collection of documents.
 
-**Semantic Viewer** is meant for viewing corpus semantics.
+**Semantic Viewer** is meant for viewing corpus semantics. The widget takes input words and find documents or document passages containing these words.
+
+![](images/Semantic-Viewer.png)
+
+1. Filtering. Set the threshold above which text passages will be colored. The threshold is the sentence-level score, which is computed as the maximum cosine similarity between the SBERT embeddings and each sentence. Only sentences with the score above the threshold will be colored.
+2. Display either the entire document, the relevant section of the text or only the relevant sentence(s).
+3. A list of matches, scores and documents. A match is the number of matching words from the input in each document. Score is computed at a sentence level by taking the maximum cosine similarity between the SBERT embeddings of the sentence and the input keywords. The score is than aggregated to the document score, which is displayed in the list.
+
+Example
+-------
+
+In the below example, we are using the *book-excerpts* corpus, which is available in the [Corpus](corpus-widget.md) widget. 
+
+We pass the corpus to [Preprocess Text](preprocesstext.md), where we lowercase the text, split it into words with tokenization, use Lemmagen lemmatizer to cover tokens to their base form and finally remove stopwords.
+
+Next, we find characteristic words with [Extract Keywords](keywords.md) widget and send these words to [Word List](wordlist.md). There, we add some of our own words, such as princess, prince, king and queen.
+
+Finally, we pass the entire list of words to Semantic Viewer along with the corpus from Prepreprocess Text. The widget uses input word list to find matching passages in each document. We can now see parts of the text talking about princesses, queens, and so on.
+
+![](images/Semantic-Viewer-Example.png)
diff --git a/doc/widgets/wordlist.md b/doc/widgets/wordlist.md
@@ -0,0 +1,39 @@
+Word List
+=========
+
+Create a list of words.
+
+**Inputs**
+
+- Words: A table of words.
+
+**Outputs**
+
+- Selected Words: Words selected from the table.
+- Words: A table of words.
+
+**Word List** is meant for creating and joining lists of words for semantic analysis. The user can manually enter words into the widget or import them from other widgets, for example [Extract Keywords](keywords.md).
+
+![](images/Word-List-Union.png)
+
+1. Library of existing word lists. Add a new list with a "+" or remove it with "-". Use "Update" to save the current list from the right as a word list in the widget. With "More" you can load an existing list in a .txt format using the *Import Words from File* option or save the list locally using the *Save Words to File*.
+2. Input options:
+   - Word variable: set which string variable to use as a list of words.
+   - Update: define how to use an existing list and the input Words. *Intersection* will use only overlapping words, *Union* will use all the words, *Only input* will ignore the list from the widget and use only the input list and *Ignore input* will use only the list from the widget.
+3. Use *Filter* to find a word from the list. The list shows the words on the output. One can select a subset of words from the list. Use "+" to add a new word and "-" to remove it from the list. *Sort* will sort the list alphabetically.
+
+Example
+-------
+
+In this example we are using the pre-loaded *book-excerpts* corpus from the [Corpus](corpus-widget.md) widget. [Preprocess Text](preprocesstext.md) creates tokens by transforming the text to lowercase, splitting it into words, normalizing the words with Lemmagen lemmatizer and finally removing stopwords.
+
+Then we pass the preprocessed data to [Extract Keywords](keywords.md), a widget that finds characteristic words in the corpus. We have used the default TF-IDF setting and passed the top 7 words (said, one, go, sara, man, look, little) to Word List.
+
+In the Word List, we have previously defined some words that we would like to find in the text, namely princess, prince, king, queen. We have used *Union* to keep both the list we have manually defined and the one we have input from the Extract Keywords.
+
+Finally, we send the entire word list to [Semantic Viewer](semanticviewer.md) and add the *Corpus* output from Preprocess Text as well. Semantic Viewer now scores documents based on the input word list. The higher the score, the more matches the document has.
+
+This is a nice way to find a content of interest (say princes and princesses) in a collection of texts.
+
+![](images/Semantic-Viewer-Example.png)
+