Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic $S^3$ #72

Merged
merged 10 commits into from
Nov 25, 2024
47 changes: 30 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,20 @@

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.

### New in version 0.8.0
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
### New in version 0.9.0

#### Automated Topic Naming

Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!
#### Dynamic S³ 🧭

You can now use Semantic Signal Separation in a dynamic fashion.
This allows you to investigate how semantic axes fluctuate over time, and how their content changes.
```python
from turftopic import KeyNMF
from turftopic.namers import OpenAITopicNamer
from turftopic import SemanticSignalSeparation

model = KeyNMF(10).fit(corpus)
model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
model.plot_topics_over_time()
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| | ... |

## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
Expand Down Expand Up @@ -143,6 +133,29 @@ model.print_topic_distribution(

</center>

#### Automated Topic Naming

Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!

```python
from turftopic import KeyNMF
from turftopic.namers import OpenAITopicNamer

model = KeyNMF(10).fit(corpus)
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
```

| Topic ID | Topic Name | Highest Ranking |
| - | - | - |
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| | ... |

### Visualization

Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
Expand Down
6 changes: 3 additions & 3 deletions docs/KeyNMF.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,12 +221,12 @@ pip install plotly
```

```python
model.plot_topics_over_time(top_k=5)
model.plot_topics_over_time()
```

<figure>
<img src="../images/dynamic_keynmf.png" width="50%" style="margin-left: auto;margin-right: auto;">
<figcaption>Topics over time on a Figure</figcaption>
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
</figure>

### Online Topic Modeling
Expand Down
9 changes: 5 additions & 4 deletions docs/dynamic.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,15 @@ In Turftopic you can currently use three different topic models for modeling top
1. [ClusteringTopicModel](clustering.md), where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
2. [GMM](GMM.md), similarly to clustering models, term importances are reestimated per time slice
3. [KeyNMF](KeyNMF.md), an overall decomposition is done, then using coordinate descent, topic-term-matrices are recalculated based on document-topic importances in the given time slice.
4. [SemanticSignalSeparation](s3.md), a global model is fitted and then local models are inferred using linear regression from embeddings and document-topic signals in a given time-slice.

## Usage

Dynamic topic models in Turftopic have a unified interface.
To fit a dynamic topic model you will need a corpus, that has been annotated with timestamps.
The timestamps need to be Python `datetime` objects, but pandas `Timestamp` object are also supported.

Models that have dynamic modeling capabilities (`KeyNMF`, `GMM` and `ClusteringTopicModel`) have a `fit_transform_dynamic()` method, that fits the model on the corpus over time.
Models that have dynamic modeling capabilities (`KeyNMF`, `GMM`, `SemanticSignalSeparation` and `ClusteringTopicModel`) have a `fit_transform_dynamic()` method, that fits the model on the corpus over time.

```python
from datetime import datetime
Expand Down Expand Up @@ -69,12 +70,12 @@ pip install plotly
```

```python
model.plot_topics_over_time(top_k=5)
model.plot_topics_over_time()
```

<figure>
<img src="../images/dynamic_keynmf.png" width="80%" style="margin-left: auto;margin-right: auto;">
<figcaption>Topics over time on a Figure</figcaption>
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
</figure>

## API reference
Expand Down
14 changes: 14 additions & 0 deletions docs/images/dynamic_keynmf.html

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions docs/images/dynamic_s3.html

Large diffs are not rendered by default.

Binary file added docs/images/dynamic_s3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions docs/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,39 @@ Based on our evaluations, however, we recommend that you use axial or combined t
Axial topics tend to result in the most coherent topics, while angular topics result in the most distinct ones.
The combined approach is a reasonable compromise between the two methods, and is thus the default.

### Dynamic Topic Modeling *(Optional)*

$S^3$ can also be used as a dynamic topic model.
Temporally changing components are found using the following steps:

1. Fit a global $S^3$ model over the whole corpus.
2. Estimate unmixing matrix for each time-slice by fitting a linear regression from the embeddings in the time slice to the document-topic-matrix for the time slice estimated by the global model.
3. Estimate term importances for each time slice the same way as the global model.

```python
from datetime import datetime
from turftopic import SemanticSignalSeparation

ts: list[datetime] = [datetime(year=2018, month=2, day=12), ...]
corpus: list[str] = ["First document", ...]

model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
model.plot_topics_over_time()
```

!!! info
Topics over time in $S^3$ are treated slightly differently to most other models.
This is because topics are not proportional in $S^3$, and can tip below zero.
In the timeslices where a topic is below zero, its **negative definition** is displayed.



<figure>
<iframe src="../images/dynamic_s3.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a dynamic Semantic Signal Separation model. </figcaption>
</figure>


## Model Refitting

Unlike most other models in Turftopic, $S^3$ can be refit using different parameters and random seeds without needing to initialize the model from scratch.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ line-length=79

[tool.poetry]
name = "turftopic"
version = "0.8.1"
version = "0.9.0"
description = "Topic modeling with contextual representations from sentence transformers."
authors = ["Márton Kardos <[email protected]>"]
license = "MIT"
Expand Down
Loading
Loading