Merge branch 'llm_naming' of https://github.com/x-tabdeveloping/turft…

…opic into llm_naming
x-tabdeveloping · Nov 4, 2024 · 0951473 · 0951473
2 parents 75e5302 + 619c796
commit 0951473
Show file tree

Hide file tree

Showing 37 changed files with 2,527 additions and 766 deletions.
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -23,12 +23,12 @@ jobs:
       - name: Dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install "turftopic[pyro-ppl,docs]"
+          pip install "turftopic[pyro-ppl]" "griffe" "mkdocstrings[python]" "mkdocs" "mkdocs-material"
 
       - name: Build and Deploy
         if: github.event_name == 'push'
         run: mkdocs gh-deploy --force
 
       - name: Build
         if: github.event_name == 'pull_request'
-        run: mkdocs build
+        run: mkdocs build
diff --git a/README.md b/README.md
@@ -7,54 +7,71 @@
 ## Features
  - Novel transformer-based topic models:
    - Semantic Signal Separation - S³ 🧭
-   - KeyNMF 🔑 (paper in progress ⏳)
+   - KeyNMF 🔑 
    - GMM :gem: (paper soon)
- - Implementations of existing transformer-based topic models
+ - Implementations of other transformer-based topic models
    - Clustering Topic Models: BERTopic and Top2Vec
    - Autoencoding Topic Models: CombinedTM and ZeroShotTM
+   - FASTopic
  - Streamlined scikit-learn compatible API 🛠️
  - Easy topic interpretation 🔍
  - Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
  - Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
 
 > This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
 
-### New in version 0.4.0
+### New in version 0.7.0
 
-#### Online KeyNMF
+#### Component re-estimation, refitting and topic merging
 
-You can now online fit and finetune KeyNMF as you wish!
+Some models can now easily be modified after being trained in an efficient manner,
+without having to recompute all attributes from scratch.
+This is especially significant for clustering models and $S^3$.
 
 ```python
-from itertools import batched
-from turftopic import KeyNMF
+from turftopic import SemanticSignalSeparation, ClusteringTopicModel
+
+s3_model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
+# Re-estimating term importances
+s3_model.estimate_components(feature_importance="angular")
+# Refitting S^3 with a different number of topics (very fast)
+s3_model.refit(n_components=10, random_seed=42)
+
+clustering_model = ClusteringTopicModel().fit(corpus)
+# Reduces number of topics automatically with a given method
+clustering_model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
+# Merge topics manually
+clustering_model.join_topics([0,3,4,5])
+# Resets original topics
+clustering_model.reset_topics()
+# Re-estimates term importances based on a different method
+clustering_model.estimate_components(feature_importance="centroid")
+```
 
-model = KeyNMF(10, top_n=5)
+#### Manual topic naming
 
-corpus = ["some string", "etc", ...]
-for batch in batched(corpus, 200):
-    batch = list(batch)
-    model.partial_fit(batch)
+You can now manually label topics in all models in Turftopic.
+
+```python
+# you can specify a dict mapping IDs to names
+model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
+# or a list of topic names
+model.rename_topics([f"Topic {i}" for i in range(10)])
 ```
 
-#### $S^3$ Concept Compasses
+#### Saving, loading and publishing to HF Hub
 
-You can now produce a compass of concepts along two semantic axes using $S^3$.
+You can now load, save and publish models with dedicated functionality.
 
-<table>
-  <tr>
-   <td>
-
 ```python
-model = SemanticSignalSeparation(10).fit(corpus)
-fig = model.concept_compass(topic_x=1, topic_y=4)
-fig.show()
-```
+from turftopic import load_model
+
+model.to_disk("out_folder/")
+model = load_model("out_folder/")
 
-   </td>
-   <td><img src="./docs/images/arxiv_ml_compass.png" width="350" style="margin-left: auto;margin-right: auto;"></td>
-  </tr>
-</table>
+model.push_to_hub("your_user/model_name")
+model = load_model("your_user/model_name")
+```
 
 
 ## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
@@ -180,8 +197,9 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to
 
 ## References
 - Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
+- Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
  - Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
  - Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
  - Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
- - Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European 
- - Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
+ - Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
+ - Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024, October 16). Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. arXiv.org. https://arxiv.org/abs/2410.12791
diff --git a/docs/FASTopic.md b/docs/FASTopic.md
@@ -0,0 +1,15 @@
+# FASTopic
+
+FASTopic is a neural topic model based on Dual Semantic-relation Reconstruction.
+
+> Turftopic contains an implementation repurposed for our API, but the implementation is mostly from the [original FASTopic package](https://github.com/BobXWu/FASTopic).
+
+:warning: This part of the documentation is still under construction :warning:
+
+## References
+
+Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
+
+## API Reference
+
+::: turftopic.models.fastopic.FASTopic
diff --git a/docs/GMM.md b/docs/GMM.md
@@ -1,4 +1,4 @@
-# GMM
+# GMM (Gaussian Mixture Model)
 
 GMM is a generative probabilistic model over the contextual embeddings.
 The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
@@ -9,47 +9,51 @@ These Gaussian components are assumed to be the topics.
   <figcaption>Components of a Gaussian Mixture Model <br>(figure from scikit-learn documentation)</figcaption>
 </figure>
 
-## The Model
+## How does GMM work?
 
-### 1. Generative Modeling
-
-GMM assumes that the embeddings are generated according to the following stochastic process:
-
-1. Select global topic weights: $\Theta$
-2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
-3. For each document:
-    - Draw topic label: $z \sim Categorical(\Theta)$
-    - Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
+### Generative Modeling
 
+GMM assumes that the embeddings are generated according to the following stochastic process from a number of Gaussian components.
 Priors are optionally imposed on the model parameters.
 The model is fitted either using expectation maximization or variational inference.
 
-### 2. Topic Inference over Documents
+??? info "Click to see formula"
+    1. Select global topic weights: $\Theta$
+    2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
+    3. For each document:
+        - Draw topic label: $z \sim Categorical(\Theta)$
+        - Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
+
+
+### Calculate Topic Probabilities
 
 After the model is fitted, soft topic labels are inferred for each document.
 A document-topic-matrix ($T$) is built from the likelihoods of each component given the document encodings.
 
-Or in other words for document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
+??? info "Click to see formula"
+    - For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
 
-### 3. Soft c-TF-IDF
+### Soft c-TF-IDF
 
 Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
 an extension of __c-TF-IDF__, that can be used with continuous labels.
 
-Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
-Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
+??? info "Click to see formula"
+
+    Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
+    Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
 
-- Estimate weight of term $j$ for topic $z$: <br>
-$tf_{zj} = \frac{t_{zj}}{w_z}$, where 
-$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and 
-$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
-- Estimate inverse document/topic frequency for term $j$:  
-$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
-$N$ is the total number of documents.
-- Calculate importance of term $j$ for topic $z$:   
-$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
+    - Estimate weight of term $j$ for topic $z$: <br>
+    $tf_{zj} = \frac{t_{zj}}{w_z}$, where 
+    $t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and 
+    $w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
+    - Estimate inverse document/topic frequency for term $j$:  
+    $idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
+    $N$ is the total number of documents.
+    - Calculate importance of term $j$ for topic $z$:   
+    $Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
 
-### _(Optional)_ 4. Dynamic Modeling
+### Dynamic Modeling
 
 GMM is also capable of dynamic topic modeling. This happens by fitting one underlying mixture model over the entire corpus, as we expect that there is only one semantic model generating the documents.
 To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF for each of the time slices separately.
@@ -90,25 +94,6 @@ from sklearn.decomposition import IncrementalPCA
 model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
 ```
 
-## Considerations
-
-### Strengths
-
- - Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable.
- - Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail.
- This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well.
- - Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents.
- - Dynamic Modeling: You can model changes in topics over time using GMM.
-
-### Weaknesses
-
- - Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues.
- - Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case.
- Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings.
- - Moderate Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model.
- - Moderate Robustness to Noise: GMM is similarly sensitive to noise and  stop words as BERTopic, and can sometimes find noise components. Our experience indicates that GMM is way less volatile, and the quality of the results is more reliable than with clustering models using C-TF-IDF.
-
-
 ## API Reference
 
 ::: turftopic.models.gmm.GMM