diff --git a/docs/basics.md b/docs/basics.md index 66d7f0f..8fd142a 100644 --- a/docs/basics.md +++ b/docs/basics.md @@ -170,6 +170,7 @@ document_topic_matrix = model.transform(new_documents, embeddings=None) > Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background. + ## Interpreting Models Turftopic comes with a number of pretty printing utilities for interpreting the models. @@ -236,7 +237,7 @@ latex_table: str = model.export_topics(format="latex") md_table: str = model.export_representative_documents(0, corpus, document_topic_matrix, format="markdown") ``` -### Naming topics +### Manual topic naming You can manually name topics in Turftopic models after having interpreted them. If you find a more fitting name for a topic, feel free to rename it in your model. @@ -246,8 +247,39 @@ from turftopic import SemanticSignalSeparation model = SemanticSignalSeparation(10).fit(corpus) model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"}) + ``` +### Automated topic naming + +You can also use large language models, or other NLP techniques to assign human-readable names to topics. +Here is an example of using ChatGPT to generate topic names from the highest ranking keywords. + +Read more about namer models [here](namers.md). + +```python +from turftopic import KeyNMF +from turftopic.namers import OpenAITopicNamer + +namer = OpenAITopicNamer("gpt-4o-mini") +model.rename_topics(namer) + +model.print_topics() +``` + +| Topic ID | Topic Name | Highest Ranking | +| - | - | - | +| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | +| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | +| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | +| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | +| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | +| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | +| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | +| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | +| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | +| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | + ### Visualization Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models. diff --git a/docs/finetuning.md b/docs/finetuning.md index adf7451..79c236a 100644 --- a/docs/finetuning.md +++ b/docs/finetuning.md @@ -19,6 +19,15 @@ model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"}) model.rename_topics([f"Topic {i}" for i in range(10)]) ``` +You can also automatically name topics with a [topic namer](namers.md) model. + +```python +from turftopic.namers import LLMTopicNamer + +namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct") +model.rename_topics(namer) +``` + ## Changing the number of topics Multiple models allow you to change the number of topics in a model after fitting them. diff --git a/docs/namers.md b/docs/namers.md new file mode 100644 index 0000000..dd04c7c --- /dev/null +++ b/docs/namers.md @@ -0,0 +1,136 @@ +# Topic Namers + +Sometimes, especially when the number of topics grows large, +it might be convenient to assign human-readable names to topics in an automated manner. + +Turftopic allows you to accomplish this with a number of different topic namer models. + +## Large Language Models + +Turftopic lets you utilise Large Language Models for generating human-readable topic names. +This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic. + +### Running LLMs locally + +You can use any LLM from the HuggingFace Hub to generate topic names on your own machine. +The default in Turftopic is [SmolLM](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts. + +```python +from turftopic import KeyNMF +from turftopic.namers import LLMTopicNamer + +model = KeyNMF(10).fit(corpus) + +namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct") +model.rename_topics(namer) + +model.print_topics() +``` + +| Topic ID | Topic Name | Highest Ranking | +| - | - | - | +| 0 | Windows NT | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | +| 1 | Theism vs. Atheism | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | +| 2 | "486 Motherboard" | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | +| 3 | Disk Drives | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | +| 4 | Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | +| 5 | Christianity | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | +| 6 | modem-port-serial-connect-uart-pc-9600 | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | +| 7 | "Graphics Card" | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | +| 8 | File Manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | +| 9 | Printer and Fonts | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | + +### Using OpenAI's LLMs + +You might not have the computational resources to run a high-quality LLM locally. +Luckily Turftopic allows you to use OpenAI's chat models for topic naming too! + + +!!! info + You will also need to install the `openai` Python package. + ```bash + pip install openai + export OPENAI_API_KEY="sk-" + ``` + +```python +from turftopic.namers import OpenAITopicNamer + +namer = OpenAITopicNamer("gpt-4o-mini") +model.rename_topics(namer) +model.print_topics() +``` + +| Topic ID | Topic Name | Highest Ranking | +| - | - | - | +| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | +| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | +| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | +| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | +| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | +| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | +| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | +| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | +| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | +| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | + +### Prompting + +Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation: + +```python +from turftopic.namers import OpenAITopicNamer + +system_prompt = """ +You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe. +You only repond briefly with the name of the topic, and nothing else. +""" + +prompt_template = """ +You will be tasked with naming a topic. +Based on the keywords, create a short label that best summarizes the topics. +Only respond with a short, human readable topic name and nothing else. + +The topic is described by the following set of keywords: {keywords}. +""" + +namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt) +``` + +## N-gram Patterns + +You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions. +This method typically results in lower quality names, but might be good enough for your use case. + + +```python +from turftopic.namers import NgramTopicNamer + +namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2") +model.rename_topics(namer) +model.print_topics() +``` + +| Topic ID | Topic Name | Highest Ranking | +| - | - | - | +| 0 | windows and dos | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | +| 1 | many atheists out there | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | +| 2 | hardware and software | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | +| 3 | floppy disk drives and | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | +| 4 | morality is subjective | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | +| 5 | the christian bible | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | +| 6 | the serial port | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | +| 7 | the video card | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | +| 8 | the file manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | +| 9 | the print manager | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | + + +## API Reference + +:::turftopic.namers.base.TopicNamer + +:::turftopic.namers.hf_transformers.LLMTopicNamer + +:::turftopic.namers.openai.OpenAITopicNamer + +:::turftopic.namers.ngram.NgramTopicNamer diff --git a/mkdocs.yml b/mkdocs.yml index 29b01ff..d49e4df 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -19,6 +19,7 @@ nav: - Autoencoding Models: ctm.md - FASTopic: FASTopic.md - Encoders: encoders.md + - Namers: namers.md theme: name: material logo: images/logo.svg