Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OWCollocations: widget for observing collocations #782

Merged
merged 3 commits into from
Aug 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Widgets
widgets/keywords
widgets/score-documents
widgets/semanticviewer
widgets/collocations
widgets/wordlist
widgets/ontology

Expand Down
9 changes: 9 additions & 0 deletions doc/widgets.json
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,15 @@
"search"
]
},
{
"text": "Collocations",
"doc": "widgets/collocations.md",
"icon": "../orangecontrib/text/widgets/icons/Collocations.svg",
"background": "light-blue",
"keywords": [
"PMI"
]
},
{
"text": "Statistics",
"doc": "widgets/statistics.md",
Expand Down
44 changes: 44 additions & 0 deletions doc/widgets/collocations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
Collocations
============

Compute significant bigrams and trigrams.

**Inputs**

- Corpus: A collection of documents.

**Outputs**

- Table: A list of bigrams or trigrams.

**Collocations** finds frequently co-occurring words in a corpus. It displays bigrams or trigrams by the score.

![](images/Collocations.png)

1. Settings: observe bigrams (sets of two co-occurring words) or trigrams (sets of three co-occurring words). Set the frequency threshold (remove n-grams with frequency lower than the threshold).
2. Scoring method:
- [Pointwise Mutual Information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) (PMI)
- [Chi Square](https://en.wikipedia.org/wiki/Chi-squared_test)
- [Dice](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)
- [Fisher](https://en.wikipedia.org/wiki/Fisher%27s_method)
- [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index)
- [Likelihood ratio](https://en.wikipedia.org/wiki/Likelihood-ratio_test)
- Mi Like
- [Phi Square](https://en.wikipedia.org/wiki/Phi_coefficient)
- Poisson Stirling
- Raw Frequency
- [Student's T](https://en.wikipedia.org/wiki/Student%27s_t-test)

Example
-------

**Collocations** is mostly intended for data exploration. Here, we show how to observe bigrams that occur more than five times in the corpus. Bigrams are computed using the Pointwise Mutual Information statistics.

We use the *grimm-tales-selected* data in the [Corpus](corpus-widget.md) and send the data to Collocations.

![](images/Collocations-Example.png)

References
----------

Manning, Christopher, and Hinrich Schütze. 1999. Collocations. Available at: https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
Binary file added doc/widgets/images/Collocations-Example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/Collocations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
133 changes: 133 additions & 0 deletions orangecontrib/text/widgets/icons/Collocations.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading