Skip to content

Commit

Permalink
Merge branch 'main' into dynamic_s3
Browse files Browse the repository at this point in the history
  • Loading branch information
x-tabdeveloping authored Nov 21, 2024
2 parents 3f3b117 + 1da38f1 commit ddb9197
Show file tree
Hide file tree
Showing 16 changed files with 612 additions and 75 deletions.
69 changes: 20 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,74 +5,45 @@


## Features
- Novel transformer-based topic models:
- Implementations of transformer-based topic models:
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑
- GMM :gem: (paper soon)
- Implementations of other transformer-based topic models
- GMM :gem:
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
- FASTopic
- Dynamic, Online and Hierarchical Topic Modeling
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
- Automated topic naming with LLMs
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
### New in version 0.7.0
### New in version 0.8.0

#### Component re-estimation, refitting and topic merging
#### Automated Topic Naming

Some models can now easily be modified after being trained in an efficient manner,
without having to recompute all attributes from scratch.
This is especially significant for clustering models and $S^3$.
Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!

```python
from turftopic import SemanticSignalSeparation, ClusteringTopicModel

s3_model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
# Re-estimating term importances
s3_model.estimate_components(feature_importance="angular")
# Refitting S^3 with a different number of topics (very fast)
s3_model.refit(n_components=10, random_seed=42)

clustering_model = ClusteringTopicModel().fit(corpus)
# Reduces number of topics automatically with a given method
clustering_model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
# Merge topics manually
clustering_model.join_topics([0,3,4,5])
# Resets original topics
clustering_model.reset_topics()
# Re-estimates term importances based on a different method
clustering_model.estimate_components(feature_importance="centroid")
```

#### Manual topic naming

You can now manually label topics in all models in Turftopic.

```python
# you can specify a dict mapping IDs to names
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
# or a list of topic names
model.rename_topics([f"Topic {i}" for i in range(10)])
```

#### Saving, loading and publishing to HF Hub

You can now load, save and publish models with dedicated functionality.

```python
from turftopic import load_model
from turftopic import KeyNMF
from turftopic.namers import OpenAITopicNamer

model.to_disk("out_folder/")
model = load_model("out_folder/")
model = KeyNMF(10).fit(corpus)

model.push_to_hub("your_user/model_name")
model = load_model("your_user/model_name")
namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| | ... |

## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
Expand Down
34 changes: 33 additions & 1 deletion docs/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ document_topic_matrix = model.transform(new_documents, embeddings=None)
> Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.


## Interpreting Models

Turftopic comes with a number of pretty printing utilities for interpreting the models.
Expand Down Expand Up @@ -236,7 +237,7 @@ latex_table: str = model.export_topics(format="latex")
md_table: str = model.export_representative_documents(0, corpus, document_topic_matrix, format="markdown")
```

### Naming topics
### Manual topic naming

You can manually name topics in Turftopic models after having interpreted them.
If you find a more fitting name for a topic, feel free to rename it in your model.
Expand All @@ -246,8 +247,39 @@ from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10).fit(corpus)
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})

```

### Automated topic naming

You can also use large language models, or other NLP techniques to assign human-readable names to topics.
Here is an example of using ChatGPT to generate topic names from the highest ranking keywords.

Read more about namer models [here](namers.md).

```python
from turftopic import KeyNMF
from turftopic.namers import OpenAITopicNamer

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

model.print_topics()
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |

### Visualization

Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models.
Expand Down
9 changes: 9 additions & 0 deletions docs/finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@ model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
model.rename_topics([f"Topic {i}" for i in range(10)])
```

You can also automatically name topics with a [topic namer](namers.md) model.

```python
from turftopic.namers import LLMTopicNamer

namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model.rename_topics(namer)
```

## Changing the number of topics

Multiple models allow you to change the number of topics in a model after fitting them.
Expand Down
136 changes: 136 additions & 0 deletions docs/namers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Topic Namers

Sometimes, especially when the number of topics grows large,
it might be convenient to assign human-readable names to topics in an automated manner.

Turftopic allows you to accomplish this with a number of different topic namer models.

## Large Language Models

Turftopic lets you utilise Large Language Models for generating human-readable topic names.
This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic.

### Running LLMs locally

You can use any LLM from the HuggingFace Hub to generate topic names on your own machine.
The default in Turftopic is [SmolLM](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts.

```python
from turftopic import KeyNMF
from turftopic.namers import LLMTopicNamer

model = KeyNMF(10).fit(corpus)

namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model.rename_topics(namer)

model.print_topics()
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | Windows NT | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | Theism vs. Atheism | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | "486 Motherboard" | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | Disk Drives | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| 4 | Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
| 5 | Christianity | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
| 6 | modem-port-serial-connect-uart-pc-9600 | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
| 7 | "Graphics Card" | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
| 8 | File Manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
| 9 | Printer and Fonts | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |

### Using OpenAI's LLMs

You might not have the computational resources to run a high-quality LLM locally.
Luckily Turftopic allows you to use OpenAI's chat models for topic naming too!


!!! info
You will also need to install the `openai` Python package.
```bash
pip install openai
export OPENAI_API_KEY="sk-<your key goes here>"
```

```python
from turftopic.namers import OpenAITopicNamer

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |

### Prompting

Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation:

```python
from turftopic.namers import OpenAITopicNamer

system_prompt = """
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
You only repond briefly with the name of the topic, and nothing else.
"""

prompt_template = """
You will be tasked with naming a topic.
Based on the keywords, create a short label that best summarizes the topics.
Only respond with a short, human readable topic name and nothing else.
The topic is described by the following set of keywords: {keywords}.
"""

namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt)
```

## N-gram Patterns

You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions.
This method typically results in lower quality names, but might be good enough for your use case.


```python
from turftopic.namers import NgramTopicNamer

namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2")
model.rename_topics(namer)
model.print_topics()
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | windows and dos | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | many atheists out there | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | hardware and software | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | floppy disk drives and | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| 4 | morality is subjective | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
| 5 | the christian bible | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
| 6 | the serial port | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
| 7 | the video card | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
| 8 | the file manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
| 9 | the print manager | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |


## API Reference

:::turftopic.namers.base.TopicNamer

:::turftopic.namers.hf_transformers.LLMTopicNamer

:::turftopic.namers.openai.OpenAITopicNamer

:::turftopic.namers.ngram.NgramTopicNamer
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ nav:
- Autoencoding Models: ctm.md
- FASTopic: FASTopic.md
- Encoders: encoders.md
- Namers: namers.md
theme:
name: material
logo: images/logo.svg
Expand Down
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ line-length=79

[tool.poetry]
name = "turftopic"
version = "0.7.0"
version = "0.8.1"
description = "Topic modeling with contextual representations from sentence transformers."
authors = ["Márton Kardos <[email protected]>"]
license = "MIT"
Expand All @@ -23,12 +23,14 @@ rich = "^13.6.0"
huggingface-hub = "^0.23.2"
joblib = "^1.2.0"
pyro-ppl = { version = "^1.8.0", optional = true }
openai = { version = "^1.40.0", optional = true }
mkdocs = { version = "^1.5.2", optional = true }
mkdocs-material = { version = "^9.5.12", optional = true }
mkdocstrings = { version = "^0.24.0", extras = ["python"], optional = true }

[tool.poetry.extras]
pyro-ppl = ["pyro-ppl"]
openai = ["openai"]
docs = ["mkdocs", "mkdocs-material", "mkdocstrings"]

[build-system]
Expand Down
Loading

0 comments on commit ddb9197

Please sign in to comment.