Skip to content

Commit

Permalink
Merge branch 'llm_naming' of https://github.com/x-tabdeveloping/turft…
Browse files Browse the repository at this point in the history
…opic into llm_naming
  • Loading branch information
x-tabdeveloping committed Nov 4, 2024
2 parents 75e5302 + 619c796 commit 0951473
Show file tree
Hide file tree
Showing 37 changed files with 2,527 additions and 766 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ jobs:
- name: Dependencies
run: |
python -m pip install --upgrade pip
pip install "turftopic[pyro-ppl,docs]"
pip install "turftopic[pyro-ppl]" "griffe" "mkdocstrings[python]" "mkdocs" "mkdocs-material"
- name: Build and Deploy
if: github.event_name == 'push'
run: mkdocs gh-deploy --force

- name: Build
if: github.event_name == 'pull_request'
run: mkdocs build
run: mkdocs build
74 changes: 46 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,54 +7,71 @@
## Features
- Novel transformer-based topic models:
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑 (paper in progress ⏳)
- KeyNMF 🔑
- GMM :gem: (paper soon)
- Implementations of existing transformer-based topic models
- Implementations of other transformer-based topic models
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
- FASTopic
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
### New in version 0.4.0
### New in version 0.7.0

#### Online KeyNMF
#### Component re-estimation, refitting and topic merging

You can now online fit and finetune KeyNMF as you wish!
Some models can now easily be modified after being trained in an efficient manner,
without having to recompute all attributes from scratch.
This is especially significant for clustering models and $S^3$.

```python
from itertools import batched
from turftopic import KeyNMF
from turftopic import SemanticSignalSeparation, ClusteringTopicModel

s3_model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
# Re-estimating term importances
s3_model.estimate_components(feature_importance="angular")
# Refitting S^3 with a different number of topics (very fast)
s3_model.refit(n_components=10, random_seed=42)

clustering_model = ClusteringTopicModel().fit(corpus)
# Reduces number of topics automatically with a given method
clustering_model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
# Merge topics manually
clustering_model.join_topics([0,3,4,5])
# Resets original topics
clustering_model.reset_topics()
# Re-estimates term importances based on a different method
clustering_model.estimate_components(feature_importance="centroid")
```

model = KeyNMF(10, top_n=5)
#### Manual topic naming

corpus = ["some string", "etc", ...]
for batch in batched(corpus, 200):
batch = list(batch)
model.partial_fit(batch)
You can now manually label topics in all models in Turftopic.

```python
# you can specify a dict mapping IDs to names
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
# or a list of topic names
model.rename_topics([f"Topic {i}" for i in range(10)])
```

#### $S^3$ Concept Compasses
#### Saving, loading and publishing to HF Hub

You can now produce a compass of concepts along two semantic axes using $S^3$.
You can now load, save and publish models with dedicated functionality.

<table>
<tr>
<td>

```python
model = SemanticSignalSeparation(10).fit(corpus)
fig = model.concept_compass(topic_x=1, topic_y=4)
fig.show()
```
from turftopic import load_model

model.to_disk("out_folder/")
model = load_model("out_folder/")

</td>
<td><img src="./docs/images/arxiv_ml_compass.png" width="350" style="margin-left: auto;margin-right: auto;"></td>
</tr>
</table>
model.push_to_hub("your_user/model_name")
model = load_model("your_user/model_name")
```


## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
Expand Down Expand Up @@ -180,8 +197,9 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to

## References
- Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
- Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
- Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
- Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
- Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
- Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European
- Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
- Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
- Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024, October 16). Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. arXiv.org. https://arxiv.org/abs/2410.12791
15 changes: 15 additions & 0 deletions docs/FASTopic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# FASTopic

FASTopic is a neural topic model based on Dual Semantic-relation Reconstruction.

> Turftopic contains an implementation repurposed for our API, but the implementation is mostly from the [original FASTopic package](https://github.com/BobXWu/FASTopic).
:warning: This part of the documentation is still under construction :warning:

## References

Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.

## API Reference

::: turftopic.models.fastopic.FASTopic
75 changes: 30 additions & 45 deletions docs/GMM.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# GMM
# GMM (Gaussian Mixture Model)

GMM is a generative probabilistic model over the contextual embeddings.
The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
Expand All @@ -9,47 +9,51 @@ These Gaussian components are assumed to be the topics.
<figcaption>Components of a Gaussian Mixture Model <br>(figure from scikit-learn documentation)</figcaption>
</figure>

## The Model
## How does GMM work?

### 1. Generative Modeling

GMM assumes that the embeddings are generated according to the following stochastic process:

1. Select global topic weights: $\Theta$
2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
3. For each document:
- Draw topic label: $z \sim Categorical(\Theta)$
- Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
### Generative Modeling

GMM assumes that the embeddings are generated according to the following stochastic process from a number of Gaussian components.
Priors are optionally imposed on the model parameters.
The model is fitted either using expectation maximization or variational inference.

### 2. Topic Inference over Documents
??? info "Click to see formula"
1. Select global topic weights: $\Theta$
2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
3. For each document:
- Draw topic label: $z \sim Categorical(\Theta)$
- Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$


### Calculate Topic Probabilities

After the model is fitted, soft topic labels are inferred for each document.
A document-topic-matrix ($T$) is built from the likelihoods of each component given the document encodings.

Or in other words for document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
??? info "Click to see formula"
- For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$

### 3. Soft c-TF-IDF
### Soft c-TF-IDF

Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
an extension of __c-TF-IDF__, that can be used with continuous labels.

Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
??? info "Click to see formula"

Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:

- Estimate weight of term $j$ for topic $z$: <br>
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and
$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
- Estimate inverse document/topic frequency for term $j$:
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
$N$ is the total number of documents.
- Calculate importance of term $j$ for topic $z$:
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
- Estimate weight of term $j$ for topic $z$: <br>
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and
$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
- Estimate inverse document/topic frequency for term $j$:
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
$N$ is the total number of documents.
- Calculate importance of term $j$ for topic $z$:
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$

### _(Optional)_ 4. Dynamic Modeling
### Dynamic Modeling

GMM is also capable of dynamic topic modeling. This happens by fitting one underlying mixture model over the entire corpus, as we expect that there is only one semantic model generating the documents.
To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF for each of the time slices separately.
Expand Down Expand Up @@ -90,25 +94,6 @@ from sklearn.decomposition import IncrementalPCA
model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
```

## Considerations

### Strengths

- Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable.
- Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail.
This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well.
- Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents.
- Dynamic Modeling: You can model changes in topics over time using GMM.

### Weaknesses

- Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues.
- Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case.
Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings.
- Moderate Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model.
- Moderate Robustness to Noise: GMM is similarly sensitive to noise and stop words as BERTopic, and can sometimes find noise components. Our experience indicates that GMM is way less volatile, and the quality of the results is more reliable than with clustering models using C-TF-IDF.


## API Reference

::: turftopic.models.gmm.GMM
Loading

0 comments on commit 0951473

Please sign in to comment.