Merge pull request #3 from x-tabdeveloping/docs

Docs
x-tabdeveloping · Feb 3, 2024 · a7b7071 · a7b7071
2 parents 8683fa0 + 9ccd5c2
commit a7b7071
Show file tree

Hide file tree

Showing 76 changed files with 14,953 additions and 0 deletions.
diff --git a/docs/GMM.md b/docs/GMM.md
@@ -0,0 +1,81 @@
+# GMM
+
+GMM is a generative probabilistic model over the contextual embeddings.
+The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
+These Gaussian components are assumed to be the topics.
+
+<figure>
+  <img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_gmm_pdf_001.png" width="80%" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Components of a Gaussian Mixture Model <br>(figure from scikit-learn documentation)</figcaption>
+</figure>
+
+## The Model
+
+### 1. Generative Modeling
+
+GMM assumes that the embeddings are generated according to the following stochastic process:
+
+1. Select global topic weights: $\Theta$
+2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
+3. For each document:
+    - Draw topic label: $z \sim Categorical(\Theta)$
+    - Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
+
+Priors are optionally imposed on the model parameters.
+The model is fitted either using expectation maximization or variational inference.
+
+### 2. Topic Inference over Documents
+
+After the model is fitted, soft topic labels are inferred for each document.
+A document-topic-matrix ($T$) is built from the likelihoods of each component given the document encodings.
+
+Or in other words for document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
+
+### 3. Soft c-TF-IDF
+
+Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
+an extension of __c-TF-IDF__, that can be used with continuous labels.
+
+Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
+Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
+
+- Estimate weight of term $j$ for topic $z$: <br>
+$tf_{zj} = \frac{t_{zj}}{w_z}$, where 
+$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and 
+$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
+- Estimate inverse document/topic frequency for term $j$:  
+$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
+$N$ is the total number of documents.
+- Calculate importance of term $j$ for topic $z$:   
+$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
+
+### Similarities with Clustering Models
+
+Gaussian Mixtures can in some sense be considered a fuzzy clustering model.
+
+Since we assume the existence of a ground truth label for each document, the model technically cannot capture multiple topics in a document,
+only uncertainty around the topic label.
+
+This makes GMM better at accounting for documents which are the intersection of two or more semantically close topics.
+
+Another important distinction is that clustering topic models are typically transductive, while GMM is inductive.
+This means that in the case of GMM we are inferring some underlying semantic structure, from which the different documents are generated,
+instead of just describing the corpus at hand.
+In practical terms this means that GMM can, by default infer topic labels for documents, while (some) clustering models cannot.
+
+## Considerations
+
+### Strengths
+
+ - Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable.
+ - High Quality: GMM typically finds very high quality and easily interpretable topics.
+ - Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail.
+ This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well.
+ - Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents.
+
+### Weaknesses
+
+ - Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues.
+ - Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case.
+ Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings.
+ - Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model.
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -0,0 +1,44 @@
+# KeyNMF
+
+KeyNMF is a topic model that relies on contextually sensitive embeddings for keyword retrieval and term importance estimation,
+while taking inspiration from classical matrix-decomposition approaches for extracting topics.
+
+## The Model
+
+<figure>
+  <img src="/images/keynmf.png" width="90%" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Schematic overview of KeyNMF</figcaption>
+</figure>
+
+### 1. Keyword Extraction
+
+The first step of the process is gaining enhanced representations of documents by using contextual embeddings.
+Both the documents and the vocabulary get encoded with the same sentence encoder.
+
+Keywords are assigned to each document based on the cosine similarity of the document embedding to the embedded words in the document.
+Only the top K words with positive cosine similarity to the document are kept.
+
+These keywords are then arranged into a document-term importance matrix where each column represents a keyword that was encountered in at least one document,
+and each row is a document.
+The entries in the matrix are the cosine similarities of the given keyword to the document in semantic space.
+
+### 2. Topic Discovery
+
+Topics in this matrix are then discovered using Non-negative Matrix Factorization.
+Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
+can be explained.
+
+## Considerations
+
+### Strengths
+
+ - Stability, Robustness and Quality: KeyNMF extracts very clean topics even when a lot of noise is present in the corpus, and the model's performance remains relatively stable across domains.
+ - Scalability: The model can be fitted in an online fashion, and we recommend that you choose KeyNMF when the number of documents is large (over 100 000).
+ - Fail Safe and Adjustable: Since the modelling process consists of multiple easily separable steps it is easy to repeat one if something goes wrong. This also makes it an ideal choice for production usage.
+ - Can capture multiple topics in a document.
+
+### Weaknesses
+
+ - Lack of Multilingual Capabilities: KeyNMF as it is currently implemented cannot be used in a multilingual context. Changes to the model that allow this are possible, and will likely be ijmplemented in the future.
+ - Lack of Nuance: Since only the top K keywords are considered and used for topic extraction some of the nuances, especially in long texts might get lost. We therefore recommend that you scale K with the average length of the texts you're working with. For tweets it might be worth it to scale it down to 5, while with longer documents, a larger number (let's say 50) might be advisable.
+ - Practitioners have to choose the number of topics a priori.
diff --git a/docs/clustering.md b/docs/clustering.md
@@ -0,0 +1,169 @@
+# Clustering Topic Models
+
+Clustering topic models conceptualize topic modeling as a clustering task.
+Essentially a topic for these models is a tightly packed group of documents in semantic space.
+
+The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.
+
+Turftopic contains flexible implementations of these models where you have control over each of the steps in the process,
+while sticking to a minimal amount of extra dependencies.
+While the models themselves can be equivalent to BERTopic and Top2Vec implementations, Turftopic might not offer some of the implementation-specific features,
+that the other libraries boast.
+
+## The Model
+
+### 1. Dimensionality Reduction
+
+It is common practice in clustering topic modeling literature to reduce the dimensionality of the embeddings before clustering them.
+This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.
+
+Dimensionality reduction by default is done with scikit-learn's TSNE implementation in Turftopic,
+but users are free to specify the model that will be used for dimensionality reduction.
+
+Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature.
+Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
+
+### 2. Clustering
+
+After reducing the dimensionality of the embeddings, they are clustered with a clustering model.
+As HDBSCAN  has only been part of scikit-learn since version 1.3.0, Turftopic uses OPTICS as its default.
+
+Some clustering models are capable of discovering the number of clusters in the data.
+This is a useful and yet-to-be challenged property of clustering topic models.
+
+Practice suggests, however, that in large corpora, this frequently results in a very large number of topics, which is impractical for interpretation.
+Models' hyperparameters can be adjusted to account for this behaviour, but the impact of choice of hyperparameters on topic quality is more or less unknown.
+
+### 3a. Term Importance: Proximity to Cluster Centroids
+
+Clustering topic models rely on post-hoc term importance estimation.
+Currently there are two methods used for this.
+
+The solution introduced in Top2Vec (Angelov, 2020) is that of estimating terms' importances for a given topic from their
+embeddings' cosine similarity to the centroid of the embeddings in a cluster.
+
+<figure>
+  <img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_words.svg?sanitize=true" width="600px" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Terms Close to the Topic Vector <br>(figure from Top2Vec documentation)</figcaption>
+</figure>
+
+This has three implications:
+
+1. Topic descriptions are very specific. As the closest terms to the topic vector are selected, they tend to also be very close to each other.
+ The issue with this is that many of the documents in a topic might not get proper coverage.
+2. It is assumed that the clusters are convex and spherical. This might not at all be the case, and especially when clusters are concave, 
+ the closest terms to the centroid might end up describing a different, or nonexistent topic.
+ In other words: The mean might not be a representative datapoint of the population.
+3. Noise rarely gets into topic descriptions. Since functions words or contaminating terms are not very likely to be closest to the topic vector,
+ decriptions are typically clean.
+
+<figure>
+  <img src="../images/cluster_centroid_problem.png" width="80%" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Centroids of Non-Convex Clusters</figcaption>
+</figure>
+
+### 3b. Term Importance: c-TF-IDF
+
+The solution to this issue, suggested by Grootendorst (2022) to this issue was c-TF-IDF.
+
+c-TF-IDF is a weighting scheme based on the number of occurrences of terms in each cluster.
+Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
+
+Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
+Turftopic uses a modified version of c-TF-IDF, which is calculated in the following manner:
+
+- Estimate weight of term $j$ for topic $z$: <br>
+$tf_{zj} = \frac{t_{zj}}{w_z}$, where 
+$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and 
+$w_{z}= \sum_{j} t_{zj}$ is all words in the topic <br>
+- Estimate inverse document/topic frequency for term $j$:  
+$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
+$N$ is the total number of documents.
+- Calculate importance of term $j$ for topic $z$:   
+$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
+
+This solution is generally to be preferred to centroid-based term importance (and the default in Turftopic), as it is more likely to give correct results.
+On the other hand, c-TF-IDF can be sensitive to words with atypical statistical properties (stop words), and can result in low diversity between topics, when clusters are joined post-hoc.
+
+## Comparison to BERTopic and Top2Vec
+
+Turftopic's implementation differs in multiple places to BERTopic and Top2Vec.
+You can, however, construct models in Turftopic that imitate the behaviour of these other packages.
+
+The main differences to these packages are:
+ - The c-TF-IDF formulae are not identical. BERTopic's version might be added in the future for compatibility.
+ - Dimensionality reduction in BERTopic and Top2Vec is done with UMAP.
+ - Clustering is in BERTopic and Top2Vec is done with HDBSCAN.
+ - Turftopic does not include many of the visualization and model-specific utilities that BERTopic does.
+
+To get closest to the functionality of the two other packages you can manually set the clustering and dimensionality reduction model when creating the models:
+
+You will need UMAP and scikit-learn>=1.3.0:
+
+```bash
+pip install umap-learn scikit-learn>=1.3.0
+```
+
+This is how you build a BERTopic-like model in Turftopic:
+
+```python
+from turftopic import ClusteringTopicModel
+from sklearn.cluster import HDBSCAN
+import umap
+
+# I also included the default parameters of BERTopic so that the behaviour is as
+# close as possible
+berttopic = ClusteringTopicModel(
+    dimensionality_reduction=umap.UMAP(
+        n_neighbors=10,
+        n_components=5,
+        min_dist=0.0,
+        metric="cosine",
+    ),
+    clustering=HDBSCAN(
+        min_cluster_size=15,
+        metric="euclidean",
+        cluster_selection_method="eom",
+    ),
+    feature_importance="ctfidf",
+)
+```
+
+This is how you build a Top2Vec model in Turftopic:
+
+```python
+top2vec = ClusteringTopicModel(
+    dimensionality_reduction=umap.UMAP(
+        n_neighbors=15,
+        n_components=5,
+        metric="cosine"
+    ),
+    clustering=HDBSCAN(
+        min_cluster_size=15,
+        metric="euclidean",
+        cluster_selection_method="eom",
+    ),
+    feature_importance="centroid",
+)
+```
+
+## Considerations
+
+### Strengths
+
+ - Automatic Discovery of Number of Topics: Clustering models can find the number of topics by themselves. This is a useful quality of these models as practicioners can rarely make an informed decision about the number of topics a-priori.
+ - No Assumptions of Normality: With clustering models you can avoid making assumptions about cluster shapes. This is in contrast with GMMs, which assume topics to be Gaussian components.
+ - Outlier Detection: OPTICS, HDBSCAN or DBSCAN contain outlier detection. This way, outliers do not influence topic representations.
+ - Not Affected by Embedding Size: Since the models include dimensionality reduction, they are not as influenced by the curse of dimensionality as other methods.
+
+### Weaknesses
+
+ - Scalability: Clustering models typically cannot be fitted in an online fashion, and manifold learning is usually inefficient in large corpora. When the number of texts is huge, the number of topics also gets inflated, which is impractical for interpretation.
+ - Lack of Nuance: The models are unable to capture multiple topics in a document or capture uncertainty around topic labels. This makes the models impractical for longer texts as well.
+ - Sensitivity to Hyperparameters: While do not have to set the number of topics directly, the hyperparameters you choose has a huge impact on the number of topics you will end up getting. (see figure)
+ - Transductivity: Some clustering methods are transductive, meaning you can't predict topical content for new documents, as they would change the cluster structure.
+
+<figure>
+  <img src="../images/umap_hdbscan_stability.png" width="80%" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Effect of UMAP's and HDBSCAN's Hyperparameters on the Number of Topics in 20 Newsgroups</figcaption>
+</figure>