Skip to content

Commit

Permalink
Merge pull request #361 from HLasse/docs-information-theory
Browse files Browse the repository at this point in the history
docs: fix grammar in information theory docs
  • Loading branch information
HLasse authored Aug 30, 2024
2 parents 15b7670 + d3ae92c commit 88a922a
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions docs/information_theory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ from the text. These include:

- `{doc/span}._.entropy`: the Shannon entropy of the text using the `token.prob` as the probability
of each token. Entropy is defined as :math:`H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)`. Where :math:`p(x_i)` is the probability of the token :math:`x_i`.
- `{doc/span}._.perplexity`: the perplexity of the text. perplexity is a measurement of how well a
- `{doc/span}._.perplexity`: the perplexity of the text. Perplexity is a measurement of how well a
probability distribution or probability model predicts a sample. Perplexity is defined as :math:`PPL(X) = e^{-H(X)}`, where :math:`H(X)` is the entropy of the text.
- `{doc/span}._.per_word_perplexity`: The perplexity of the text, divided by the number of words. Can se considered the length normalized perplexity.
- `{doc/span}._.per_word_perplexity`: The perplexity of the text, divided by the number of words. Can be considered the length-normalized perplexity.

These information theoretic measures is for example often used to describe the complexity of a text.
These information theoretic measures are for example often used to describe the complexity of a text.
The higher the entropy, the more complex the text is.
Similarly, one could imagine filtering text based on the per word perplexity given the assumption that
highly surprising text is in fact non-coherent text pieces.
Similarly, one could imagine filtering text based on the per-word perplexity given the assumption that
highly surprising text are in fact non-coherent text pieces.

.. note:: The information theory components require an available lexeme prop table from spaCy which is not available for all languages. A warning will be raised and values set to np.nan if the table cannot be found for the language.

Expand Down

0 comments on commit 88a922a

Please sign in to comment.