Skip to content

Commit

Permalink
Merge pull request #828 from ajdapretnar/new-widget-doc
Browse files Browse the repository at this point in the history
Documentation for several widgets
  • Loading branch information
ajdapretnar authored Apr 25, 2022
2 parents 1d9bbce + 9535acf commit f0e8c60
Show file tree
Hide file tree
Showing 12 changed files with 120 additions and 5 deletions.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Widgets
widgets/keywords
widgets/score-documents
widgets/semanticviewer
widgets/wordlist

Scripting
---------
Expand Down
12 changes: 11 additions & 1 deletion doc/widgets.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,16 @@
"doc": "widgets/twitter-widget.md",
"icon": "../orangecontrib/text/widgets/icons/Twitter.svg",
"background": "light-blue",
"keywords": [
"twitter",
"tweet"
]
},
{
"text": "Wikipedia",
"doc": "widgets/wikipedia-widget.md",
"icon": "../orangecontrib/text/widgets/icons/Wikipedia.svg",
"background": "light-blue",
"keywords": []
},
{
Expand Down Expand Up @@ -176,7 +186,7 @@
},
{
"text": "Word List",
"doc": null,
"doc": "widgets/wordlist.md",
"icon": "../orangecontrib/text/widgets/icons/WordList.svg",
"background": "light-blue",
"keywords": []
Expand Down
Binary file added doc/widgets/images/Extract-Keywords.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/Score-Documents-Example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified doc/widgets/images/Score-Documents.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/Semantic-Viewer-Example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/Semantic-Viewer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/Word-List-Union.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 28 additions & 0 deletions doc/widgets/keywords.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,31 @@ Infers characteristic words from the input corpus.
- Words: A table of words.

**Extract Keywords** infers characteristic words from corpus.

![](images/Extract-Keywords.png)

1. Scoring methods for extracting words:
- TF-IDF, a method that scores by term frequency weighted by inverse document frequency. A word that is characteristic for a small number of documents, will have a high TF-IDF score, while words that appear in the entire corpus will have a low score.
- [YAKE!](http://yake.inesctec.pt/), an unsupervised state-of-the-art method that works with texts of different sizes.
- [Rake](https://github.com/zelandiya/RAKE-tutorial), an unsupervised domain-independent method based around stopword delimiters.
- Embedding, a proprietary method that gives higher scores to words that have the closest cosine distance to most documents. The distance is computed on SBERT word embeddings.

Example
-------

In the below example, we are using the *book-excerpts* corpus, which is available in the [Corpus](corpus-widget.md) widget.

We pass the corpus to [Preprocess Text](preprocesstext.md), where we lowercase the text, split it into words with tokenization, use Lemmagen lemmatizer to cover tokens to their base form and finally remove stopwords.

Next, we find characteristic words with [Extract Keywords](keywords.md) widget using the TF-IDF method. The widget returns a list of words and we can select the top-ranked words to send to the output.

We can use these words in [Word List](wordlist.md), where we can edit them, add to them or remove them. Alternatively, we can send the candidate words directly to Semantic Viewer or Score Documents.

![](images/Semantic-Viewer-Example.png)

References
----------

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In *Information Sciences Journal*. Elsevier, Vol 509, pp 257-289

Rose, S., Engel, D., Cramer, N. and Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In *Text Mining* (eds M.W. Berry and J. Kogan). https://doi.org/10.1002/9780470689646.ch1
24 changes: 21 additions & 3 deletions doc/widgets/score-documents.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,24 @@ Scores documents based on word appearance.
- **Word frequency**: The count of the frequency of a word in the text.
- **Word ratio**: Indicate the appearance of the word in the document.
- **Similarity**: The cosine similarity between document embedding and word embedding.
2. Select aggregation function to aggregate word scores in document scores.
3. Filter documents based on the document title in the first column
4. The table with the document titles in the first column and scores for in other columns.
2. Select aggregation function to aggregate word scores to document scores (mean, median, min or max).
3. Select documents:
- None: no documents are on the output
- All: the entire corpus is on the output
- Manual: manually select the documents from the table
- Top documents: n top-scored documents are sent to the output
4. If *Send Automatically*, changes are communicated automatically. Alternatively press *Send*.
5. Filter documents based on the document title in the first column. Below is the table with the document titles in the first column and scores in other columns.

Example
-------

Score Documents is used to find documents that are semantically similar to the input word list. In the example below, we are using the *book-excerpts* corpus from the [Corpus](corpus-widget.md) widget.

We pass the corpus to [Preprocess Text](preprocesstext.md), where we lowercase the text, split it into words with tokenization, use Lemmagen lemmatizer to cover tokens to their base form and finally remove stopwords.

Next, we find characteristic words with [Extract Keywords](keywords.md) widget and send these words to [Word List](wordlist.md). There, we add some of our own words, such as princess, prince, king and queen.

Finally, we pass the preprocess corpus from Preprocess Text to Score Documents and the word list from Word List widget. Score Documents scores each document based on how frequently the input words appear in it.

![](images/Score-Documents-Example.png)
21 changes: 20 additions & 1 deletion doc/widgets/semanticviewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,23 @@ Displays corpus semantics.
- Other Docs: Other documents.
- Corpus: A collection of documents.

**Semantic Viewer** is meant for viewing corpus semantics.
**Semantic Viewer** is meant for viewing corpus semantics. The widget takes input words and find documents or document passages containing these words.

![](images/Semantic-Viewer.png)

1. Filtering. Set the threshold above which text passages will be colored. The threshold is the sentence-level score, which is computed as the maximum cosine similarity between the SBERT embeddings and each sentence. Only sentences with the score above the threshold will be colored.
2. Display either the entire document, the relevant section of the text or only the relevant sentence(s).
3. A list of matches, scores and documents. A match is the number of matching words from the input in each document. Score is computed at a sentence level by taking the maximum cosine similarity between the SBERT embeddings of the sentence and the input keywords. The score is than aggregated to the document score, which is displayed in the list.

Example
-------

In the below example, we are using the *book-excerpts* corpus, which is available in the [Corpus](corpus-widget.md) widget.

We pass the corpus to [Preprocess Text](preprocesstext.md), where we lowercase the text, split it into words with tokenization, use Lemmagen lemmatizer to cover tokens to their base form and finally remove stopwords.

Next, we find characteristic words with [Extract Keywords](keywords.md) widget and send these words to [Word List](wordlist.md). There, we add some of our own words, such as princess, prince, king and queen.

Finally, we pass the entire list of words to Semantic Viewer along with the corpus from Prepreprocess Text. The widget uses input word list to find matching passages in each document. We can now see parts of the text talking about princesses, queens, and so on.

![](images/Semantic-Viewer-Example.png)
39 changes: 39 additions & 0 deletions doc/widgets/wordlist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Word List
=========

Create a list of words.

**Inputs**

- Words: A table of words.

**Outputs**

- Selected Words: Words selected from the table.
- Words: A table of words.

**Word List** is meant for creating and joining lists of words for semantic analysis. The user can manually enter words into the widget or import them from other widgets, for example [Extract Keywords](keywords.md).

![](images/Word-List-Union.png)

1. Library of existing word lists. Add a new list with a "+" or remove it with "-". Use "Update" to save the current list from the right as a word list in the widget. With "More" you can load an existing list in a .txt format using the *Import Words from File* option or save the list locally using the *Save Words to File*.
2. Input options:
- Word variable: set which string variable to use as a list of words.
- Update: define how to use an existing list and the input Words. *Intersection* will use only overlapping words, *Union* will use all the words, *Only input* will ignore the list from the widget and use only the input list and *Ignore input* will use only the list from the widget.
3. Use *Filter* to find a word from the list. The list shows the words on the output. One can select a subset of words from the list. Use "+" to add a new word and "-" to remove it from the list. *Sort* will sort the list alphabetically.

Example
-------

In this example we are using the pre-loaded *book-excerpts* corpus from the [Corpus](corpus-widget.md) widget. [Preprocess Text](preprocesstext.md) creates tokens by transforming the text to lowercase, splitting it into words, normalizing the words with Lemmagen lemmatizer and finally removing stopwords.

Then we pass the preprocessed data to [Extract Keywords](keywords.md), a widget that finds characteristic words in the corpus. We have used the default TF-IDF setting and passed the top 7 words (said, one, go, sara, man, look, little) to Word List.

In the Word List, we have previously defined some words that we would like to find in the text, namely princess, prince, king, queen. We have used *Union* to keep both the list we have manually defined and the one we have input from the Extract Keywords.

Finally, we send the entire word list to [Semantic Viewer](semanticviewer.md) and add the *Corpus* output from Preprocess Text as well. Semantic Viewer now scores documents based on the input word list. The higher the score, the more matches the document has.

This is a nice way to find a content of interest (say princes and princesses) in a collection of texts.

![](images/Semantic-Viewer-Example.png)

0 comments on commit f0e8c60

Please sign in to comment.