Skip to content

Commit

Permalink
Merge pull request #56 from x-tabdeveloping/hierarchical
Browse files Browse the repository at this point in the history
Hierarchical Topic Modeling (Divisive)
  • Loading branch information
x-tabdeveloping authored Jul 31, 2024
2 parents 4ae4047 + 9a48eaa commit eb7b00a
Show file tree
Hide file tree
Showing 15 changed files with 691 additions and 62 deletions.
56 changes: 30 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,53 +9,56 @@
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑 (paper in progress ⏳)
- GMM :gem: (paper soon)
- Implementations of existing transformer-based topic models
- Implementations of other transformer-based topic models
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
- FASTopic
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
### New in version 0.4.0
### New in version 0.5.0

#### Online KeyNMF
#### Hierarchical KeyNMF

You can now online fit and finetune KeyNMF as you wish!
You can now subdivide topics in KeyNMF at will.

```python
from itertools import batched
from turftopic import KeyNMF

model = KeyNMF(10, top_n=5)

corpus = ["some string", "etc", ...]
for batch in batched(corpus, 200):
batch = list(batch)
model.partial_fit(batch)
model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.hierarchy.divide_children(n_subtopics=3)
print(model.hierarchy)
```

#### $S^3$ Concept Compasses
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

You can now produce a compass of concepts along two semantic axes using $S^3$.

<table>
<tr>
<td>

```python
model = SemanticSignalSeparation(10).fit(corpus)
fig = model.concept_compass(topic_x=1, topic_y=4)
fig.show()
```
#### FASTopic *(Experimental)*

</td>
<td><img src="./docs/images/arxiv_ml_compass.png" width="350" style="margin-left: auto;margin-right: auto;"></td>
</tr>
</table>
You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic.

```python
from turftopic import FASTopic

model = FASTopic(10).fit(corpus)
model.print_topics()
```

## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
Expand Down Expand Up @@ -180,6 +183,7 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to

## References
- Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
- Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
- Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
- Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
- Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
Expand Down
15 changes: 15 additions & 0 deletions docs/FASTopic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# FASTopic

FASTopic is a neural topic model based on Dual Semantic-relation Reconstruction.

> Turftopic contains an implementation repurposed for our API, but the implementation is mostly from the [original FASTopic package](https://github.com/BobXWu/FASTopic).
:warning: This part of the documentation is still under construction :warning:

## References

Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.

## API Reference

::: turftopic.models.fastopic.FASTopic
41 changes: 41 additions & 0 deletions docs/KeyNMF.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,47 @@ for batch in batched(zip(corpus, timestamps)):
model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
```

## Hierarchical Topic Modeling

When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.

This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
In other words:

1. Decompose keyword matrix $M \approx WH$
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
2. Perform multiplicative updates until convergence. <br>
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$

To create a hierarchical model, you can use the `hierarchy` property of the model.

```python
# This divides each of the topics in the model to 3 subtopics.
model.hierarchy.divide_children(n_subtopics=3)
print(model.hierarchy)
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).

## Considerations

### Strengths
Expand Down
2 changes: 1 addition & 1 deletion docs/dynamic.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ model.plot_topics_over_time(top_k=5)
<figcaption>Topics over time on a Figure</figcaption>
</figure>

## Interface
## API reference

All dynamic topic models have a `temporal_components_` attribute, which contains the topic-term matrices for each time slice, along with a `temporal_importance_` attribute, which contains the importance of each topic in each time slice.

Expand Down
152 changes: 152 additions & 0 deletions docs/hierarchical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Hierarchical Topic Modeling

> Note: Hierarchical topic modeling in Turftopic is still in its early stages, you can expect more visualization utilities, tools and models in the future :sparkles:
You might expect some topics in your corpus to belong to a hierarchy of topics.
Some models in Turftopic (currently only [KeyNMF](KeyNMF.md)) allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.

## Divisive Hierarchical Modeling

Currently Turftopic, in contrast with other topic modeling libraries only allows for hierarchical modeling in a divisive context.
This means that topics can be divided into subtopics in a **top-down** manner.
[KeyNMF](KeyNMF.md) does not discover a topic hierarchy automatically,
but you can manually instruct the model to find subtopics in larger topics.

As a demonstration, let's load a corpus, that we know to have hierarchical themes.

```python
from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(
subset="all",
remove=("headers", "footers", "quotes"),
categories=[
"comp.os.ms-windows.misc",
"comp.sys.ibm.pc.hardware",
"talk.religion.misc",
"alt.atheism",
],
).data
```

In this case, we have two base themes, which are **computers**, and **religion**.
Let us fit a KeyNMF model with two topics to see if the model finds these.

```python
from turftopic import KeyNMF

model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.print_topics()
```

| Topic ID | Highest Ranking |
| - | - |
| 0 | windows, dos, os, disk, card, drivers, file, pc, files, microsoft |
| 1 | atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs |

The results conform our intuition. Topic 0 seems to revolve around IT, while Topic 1 around atheism and religion.
We can already suspect, however that more granular topics could be discovered in this corpus.
For instance Topic 0 contains terms related to operating systems, like *windows* and *dos*, but also components, like *disk* and *card*.

We can access the hierarchy of topics in the model at the current stage, with the model's `hierarchy` property.

```python
print(model.hierarchy)
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
</tt>
</div>

There isn't much to see yet, the model contains a flat hierarchy of the two topics we discovered and we are at root level.
We can dissect these topics, by adding a level to the hierarchy.

Let us add 3 subtopics to each topic on the root level.

```python
model.hierarchy.divide_children(n_subtopics=3)
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware,
while Topic 1 contains a topic about newsgroups, one about atheism, and one about morality and christianity.

You can also easily access nodes of the hierarchy by indexing it:
```python
model.hierarchy[0]
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
└── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
</tt>
</div>

You can also divide individual topics to a number of subtopics, by using the `divide()` method.
Let us divide Topic 0.0 to 5 subtopics.

```python
model.hierarchy[0][0].divide(5)
model.hierarchy
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ │ ├── <b style="color: green">0.0.1</b>: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip <br>
│ │ ├── <b style="color: green">0.0.2</b>: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating <br>
│ │ ├── <b style="color: green">0.0.3</b>: disk, disks, floppy, drive, drives, scsi, boot, hd, norton, ide <br>
│ │ ├── <b style="color: green">0.0.4</b>: dos, modem, command, ms, emm386, serial, commands, 386, drivers, batch <br>
│ │ └── <b style="color: green">0.0.5</b>: printer, print, printing, fonts, font, postscript, hp, printers, output, driver <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

## Visualization
You can visualize hierarchies in Turftopic by using the `plot_tree()` method of a topic hierarchy.
The plot is interactive and you can zoom in or hover on individual topics to get an overview of the most important words.

```python
model.hierarchy.plot_tree()
```

<figure>
<img src="../images/hierarchy_tree.png" width="90%" style="margin-left: auto;margin-right: auto;">
<figcaption>Tree plot of the hierarchy.</figcaption>
</figure>


## API reference

::: turftopic.hierarchical.TopicNode



Binary file added docs/images/hierarchy_tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 3 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,16 @@ pip install turftopic[pyro-ppl]
You can use most transformer-based topic models in Turftopic, these include:

- [Semantic Signal Separation - $S^3$](s3.md) :compass:
- [KeyNMF](KeyNMF.md) :key:
- [KeyNMF](KeyNMF.md) :key:
- [Gaussian Mixture Models (GMM)](gmm.md)
- [Clustering Topic Models](clustering.md):
- [BERTopic](clustering.md#bertopic_and_top2vec)
- [Top2Vec](clustering.md#bertopic_and_top2vec)
- [Auto-encoding Topic Models](ctm.md):
- CombinedTM
- ZeroShotTM
- [FASTopic](fastopic.md) :zap:



## Basic Usage
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ nav:
- Using Turftopic: basics.md
- Dynamic Topic Modeling: dynamic.md
- Online Topic Modeling: online.md
- Hierarchical Topic Modeling: hierarchical.md
- Model Persistence: persistence.md
- Models:
- Model Overview: model_overview.md
Expand All @@ -15,6 +16,7 @@ nav:
- GMM: GMM.md
- Clustering Models: clustering.md
- Autoencoding Models: ctm.md
- FASTopic: fastopic.md
- Encoders: encoders.md
theme:
name: material
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ line-length=79

[tool.poetry]
name = "turftopic"
version = "0.4.5"
version = "0.5.0"
description = "Topic modeling with contextual representations from sentence transformers."
authors = ["Márton Kardos <[email protected]>"]
license = "MIT"
Expand Down
Loading

0 comments on commit eb7b00a

Please sign in to comment.