Skip to content

Commit

Permalink
[DOC] Collocations
Browse files Browse the repository at this point in the history
  • Loading branch information
ajdapretnar committed Aug 22, 2022
1 parent 4a0d549 commit 3ee3b17
Show file tree
Hide file tree
Showing 5 changed files with 54 additions and 0 deletions.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Widgets
widgets/keywords
widgets/score-documents
widgets/semanticviewer
widgets/collocations
widgets/wordlist
widgets/ontology

Expand Down
9 changes: 9 additions & 0 deletions doc/widgets.json
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,15 @@
"search"
]
},
{
"text": "Collocations",
"doc": "widgets/collocations.md",
"icon": "../orangecontrib/text/widgets/icons/Collocations.svg",
"background": "light-blue",
"keywords": [
"PMI"
]
},
{
"text": "Statistics",
"doc": "widgets/statistics.md",
Expand Down
44 changes: 44 additions & 0 deletions doc/widgets/collocations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
Collocations
============

Compute significant bigrams and trigrams.

**Inputs**

- Corpus: A collection of documents.

**Outputs**

- Table: A list of bigrams or trigrams.

**Collocations** finds frequently co-occurring words in a corpus. It displays bigrams or trigrams by the score.

![](images/Collocations.png)

1. Settings: observe bigrams (sets of two co-occurring words) or trigrams (sets of three co-occurring words). Set the frequency threshold (remove n-grams with frequency lower than the threshold).
2. Scoring method:
- [Pointwise Mutual Information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) (PMI)
- [Chi Square](https://en.wikipedia.org/wiki/Chi-squared_test)
- [Dice](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)
- [Fisher](https://en.wikipedia.org/wiki/Fisher%27s_method)
- [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index)
- [Likelihood ratio](https://en.wikipedia.org/wiki/Likelihood-ratio_test)
- Mi Like
- [Phi Square](https://en.wikipedia.org/wiki/Phi_coefficient)
- Poisson Stirling
- Raw Frequency
- [Student's T](https://en.wikipedia.org/wiki/Student%27s_t-test)

Example
-------

**Collocations** is mostly intended for data exploration. Here, we show how to observe bigrams that occur more than five times in the corpus. Bigrams are computed using the Pointwise Mutual Information statistics.

We use the *grimm-tales-selected* data in the [Corpus](corpus-widget.md) and send the data to Collocations.

![](images/Collocations-Example.png)

References
----------

Manning, Christopher, and Hinrich Schütze. 1999. Collocations. Available at: https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
Binary file added doc/widgets/images/Collocations-Example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/widgets/images/Collocations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 3ee3b17

Please sign in to comment.