Merge pull request #56 from x-tabdeveloping/hierarchical

Hierarchical Topic Modeling (Divisive)
x-tabdeveloping · Jul 31, 2024 · eb7b00a · eb7b00a
2 parents 4ae4047 + 9a48eaa
commit eb7b00a
Show file tree

Hide file tree

Showing 15 changed files with 691 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -9,53 +9,56 @@
    - Semantic Signal Separation - S³ 🧭
    - KeyNMF 🔑 (paper in progress ⏳)
    - GMM :gem: (paper soon)
- - Implementations of existing transformer-based topic models
+ - Implementations of other transformer-based topic models
    - Clustering Topic Models: BERTopic and Top2Vec
    - Autoencoding Topic Models: CombinedTM and ZeroShotTM
+   - FASTopic
  - Streamlined scikit-learn compatible API 🛠️
  - Easy topic interpretation 🔍
  - Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
  - Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
 
 > This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
 
-### New in version 0.4.0
+### New in version 0.5.0
 
-#### Online KeyNMF
+#### Hierarchical KeyNMF
 
-You can now online fit and finetune KeyNMF as you wish!
+You can now subdivide topics in KeyNMF at will.
 
 ```python
-from itertools import batched
 from turftopic import KeyNMF
 
-model = KeyNMF(10, top_n=5)
-
-corpus = ["some string", "etc", ...]
-for batch in batched(corpus, 200):
-    batch = list(batch)
-    model.partial_fit(batch)
+model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
+model.hierarchy.divide_children(n_subtopics=3)
+print(model.hierarchy)
 ```
 
-#### $S^3$ Concept Compasses
+<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
+<tt style="font-size: 11pt">
+<b>Root </b><br>
+├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
+│   ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
+│   ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
+│   └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
+└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
+.    ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
+.    ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
+.    └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
+</tt>
+</div>
 
-You can now produce a compass of concepts along two semantic axes using $S^3$.
 
-<table>
-  <tr>
-   <td>
-
-```python
-model = SemanticSignalSeparation(10).fit(corpus)
-fig = model.concept_compass(topic_x=1, topic_y=4)
-fig.show()
-```
+#### FASTopic *(Experimental)*
 
-   </td>
-   <td><img src="./docs/images/arxiv_ml_compass.png" width="350" style="margin-left: auto;margin-right: auto;"></td>
-  </tr>
-</table>
+You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic.
 
+```python
+from turftopic import FASTopic
+
+model = FASTopic(10).fit(corpus)
+model.print_topics()
+```
 
 ## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
@@ -180,6 +183,7 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to
 
 ## References
 - Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
+- Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
  - Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
  - Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
  - Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974

diff --git a/docs/FASTopic.md b/docs/FASTopic.md
@@ -0,0 +1,15 @@
+# FASTopic
+
+FASTopic is a neural topic model based on Dual Semantic-relation Reconstruction.
+
+> Turftopic contains an implementation repurposed for our API, but the implementation is mostly from the [original FASTopic package](https://github.com/BobXWu/FASTopic).
+
+:warning: This part of the documentation is still under construction :warning:
+
+## References
+
+Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
+
+## API Reference
+
+::: turftopic.models.fastopic.FASTopic
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -309,6 +309,47 @@ for batch in batched(zip(corpus, timestamps)):
     model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
 ```
 
+## Hierarchical Topic Modeling
+
+When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
+
+This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
+In other words:
+
+1. Decompose keyword matrix $M \approx WH$
+2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
+3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
+    1. Initialise $\mathring{H}$ and  $\mathring{W}$ randomly.
+    2. Perform multiplicative updates until convergence. <br>
+        $\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
+        $\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
+4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
+    1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$
+
+To create a hierarchical model, you can use the `hierarchy` property of the model.
+
+```python
+# This divides each of the topics in the model to 3 subtopics.
+model.hierarchy.divide_children(n_subtopics=3)
+print(model.hierarchy)
+```
+
+<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
+<tt style="font-size: 11pt">
+<b>Root </b><br>
+├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
+│   ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
+│   ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
+│   └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
+└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
+.    ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
+.    ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
+.    └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
+</tt>
+</div>
+
+For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
+
 ## Considerations
 
 ### Strengths

diff --git a/docs/dynamic.md b/docs/dynamic.md
@@ -77,7 +77,7 @@ model.plot_topics_over_time(top_k=5)
   <figcaption>Topics over time on a Figure</figcaption>
 </figure>
 
-## Interface
+## API reference
 
 All dynamic topic models have a `temporal_components_` attribute, which contains the topic-term matrices for each time slice, along with a `temporal_importance_` attribute, which contains the importance of each topic in each time slice.
 

diff --git a/docs/hierarchical.md b/docs/hierarchical.md
@@ -0,0 +1,152 @@
+# Hierarchical Topic Modeling
+
+> Note: Hierarchical topic modeling in Turftopic is still in its early stages, you can expect more visualization utilities, tools and models in the future :sparkles:
+
+You might expect some topics in your corpus to belong to a hierarchy of topics.
+Some models in Turftopic (currently only [KeyNMF](KeyNMF.md)) allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.
+
+## Divisive Hierarchical Modeling
+
+Currently Turftopic, in contrast with other topic modeling libraries only allows for hierarchical modeling in a divisive context.
+This means that topics can be divided into subtopics in a **top-down** manner.
+[KeyNMF](KeyNMF.md) does not discover a topic hierarchy automatically,
+ but you can manually instruct the model to find subtopics in larger topics.
+
+As a demonstration, let's load a corpus, that we know to have hierarchical themes.
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+
+corpus = fetch_20newsgroups(
+    subset="all",
+    remove=("headers", "footers", "quotes"),
+    categories=[
+        "comp.os.ms-windows.misc",
+        "comp.sys.ibm.pc.hardware",
+        "talk.religion.misc",
+        "alt.atheism",
+    ],
+).data
+```
+
+In this case, we have two base themes, which are **computers**, and **religion**.
+Let us fit a KeyNMF model with two topics to see if the model finds these.
+
+```python
+from turftopic import KeyNMF
+
+model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
+model.print_topics()
+```
+
+| Topic ID | Highest Ranking |
+| - | - |
+| 0 | windows, dos, os, disk, card, drivers, file, pc, files, microsoft |
+| 1 | atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs |
+
+The results conform our intuition. Topic 0 seems to revolve around IT, while Topic 1 around atheism and religion.
+We can already suspect, however that more granular topics could be discovered in this corpus.
+For instance Topic 0 contains terms related to operating systems, like *windows* and *dos*, but also components, like *disk* and *card*.
+
+We can access the hierarchy of topics in the model at the current stage, with the model's `hierarchy` property.
+
+```python
+print(model.hierarchy)
+```
+
+<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
+<tt style="font-size: 11pt">
+<b>Root </b><br>
+├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
+└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
+</tt>
+</div>
+
+There isn't much to see yet, the model contains a flat hierarchy of the two topics we discovered and we are at root level.
+We can dissect these topics, by adding a level to the hierarchy.
+
+Let us add 3 subtopics to each topic on the root level.
+
+```python
+model.hierarchy.divide_children(n_subtopics=3)
+```
+
+<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
+<tt style="font-size: 11pt">
+<b>Root </b><br>
+├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
+│   ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
+│   ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
+│   └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
+└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
+.    ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
+.    ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
+.    └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
+</tt>
+</div>
+
+As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
+Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware,
+while Topic 1 contains a topic about newsgroups, one about atheism, and one about morality and christianity.
+
+You can also easily access nodes of the hierarchy by indexing it:
+```python
+model.hierarchy[0]
+```
+
+<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
+<tt style="font-size: 11pt">
+<b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
+├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
+├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
+└── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
+</tt>
+</div>
+
+You can also divide individual topics to a number of subtopics, by using the `divide()` method.
+Let us divide Topic 0.0 to 5 subtopics.
+
+```python
+model.hierarchy[0][0].divide(5)
+model.hierarchy
+```
+
+<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
+<tt style="font-size: 11pt">
+<b>Root </b><br>
+├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
+│   ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
+│   │   ├── <b style="color: green">0.0.1</b>: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip <br>
+│   │   ├── <b style="color: green">0.0.2</b>: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating <br>
+│   │   ├── <b style="color: green">0.0.3</b>: disk, disks, floppy, drive, drives, scsi, boot, hd, norton, ide <br>
+│   │   ├── <b style="color: green">0.0.4</b>: dos, modem, command, ms, emm386, serial, commands, 386, drivers, batch <br>
+│   │   └── <b style="color: green">0.0.5</b>: printer, print, printing, fonts, font, postscript, hp, printers, output, driver <br>
+│   ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
+│   └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
+└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
+.    ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
+.    ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
+.    └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
+</tt>
+</div>
+
+## Visualization
+You can visualize hierarchies in Turftopic by using the `plot_tree()` method of a topic hierarchy.
+The plot is interactive and you can zoom in or hover on individual topics to get an overview of the most important words.
+
+```python
+model.hierarchy.plot_tree()
+```
+
+<figure>
+  <img src="../images/hierarchy_tree.png" width="90%" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Tree plot of the hierarchy.</figcaption>
+</figure>
+
+
+## API reference
+
+::: turftopic.hierarchical.TopicNode
+
+
+
diff --git a/docs/images/hierarchy_tree.png b/docs/images/hierarchy_tree.png
diff --git a/docs/index.md b/docs/index.md
@@ -23,14 +23,16 @@ pip install turftopic[pyro-ppl]
 You can use most transformer-based topic models in Turftopic, these include:
 
  - [Semantic Signal Separation - $S^3$](s3.md) :compass:
-- [KeyNMF](KeyNMF.md) :key:
+ - [KeyNMF](KeyNMF.md) :key:
  - [Gaussian Mixture Models (GMM)](gmm.md)
  - [Clustering Topic Models](clustering.md):
     - [BERTopic](clustering.md#bertopic_and_top2vec)
     - [Top2Vec](clustering.md#bertopic_and_top2vec)
  - [Auto-encoding Topic Models](ctm.md):
     - CombinedTM
     - ZeroShotTM
+ - [FASTopic](fastopic.md) :zap:
+
 
 
 ## Basic Usage

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -7,6 +7,7 @@ nav:
     - Using Turftopic: basics.md
     - Dynamic Topic Modeling: dynamic.md
     - Online Topic Modeling: online.md
+    - Hierarchical Topic Modeling: hierarchical.md
     - Model Persistence: persistence.md
   - Models:
     - Model Overview: model_overview.md
@@ -15,6 +16,7 @@ nav:
     - GMM: GMM.md
     - Clustering Models: clustering.md
     - Autoencoding Models: ctm.md
+    - FASTopic: fastopic.md
   - Encoders: encoders.md
 theme:
   name: material

diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ line-length=79
 
 [tool.poetry]
 name = "turftopic"
-version = "0.4.5"
+version = "0.5.0"
 description = "Topic modeling with contextual representations from sentence transformers."
 authors = ["Márton Kardos <[email protected]>"]
 license = "MIT"