Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dynamic topic modeling to clustering models #28

Merged
merged 9 commits into from
Mar 21, 2024

Conversation

rbroc
Copy link
Collaborator

@rbroc rbroc commented Mar 15, 2024

fixes #11

Need to:

  • add fit_transform_dynamic to ClusteringTopicModel
  • add tests for dynamic models

Still WIP and not tested, need to discuss what to do to compute temporal components when feature importance is based on distance from topic centroid, i.e.:

  • are topic centroids computed based on all documents, or only documents present in a given time bin?
  • what to do with terms that do not occur in the documents? set similarity to 0 or -1 for terms that do not occur within a given time bin, or use all terms?

@rbroc rbroc force-pushed the dynamic-clustering branch from cb3b2af to 17878fe Compare March 20, 2024 09:07
@rbroc rbroc requested a review from x-tabdeveloping March 20, 2024 09:34
@rbroc
Copy link
Collaborator Author

rbroc commented Mar 20, 2024

This might be ready for a first review.

Couple of notes when feature_importance='centroid':

  • We decided to compute topic centroids based only on documents present in a time slice. When no documents are present, the resulting vectors is all nans. For the purpose of computing feature importances, vectors are set to zero in a slightly hacky way because sklearn.metrics.pairwise_distances does not take non-finite values with predefined metrics;
  • In the topic-term matrix, I set values for both topics for which no documents are available and terms that are not present in that time slice to NaN. We might want to discuss whether this is what we want, or whether zeros are better (perhaps for consistency with other feature_importance options)

Docs deployment is failing, but this is a separate issue #30.

@rbroc
Copy link
Collaborator Author

rbroc commented Mar 20, 2024

oh also there's a few redundant commits because I accidentally pushed some unrelated stuff, let me know if you'd rather have a cleaner PR

Copy link
Owner

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pwetty nice. Couple of comments, but feel free to merge if you feel so inclined.

turftopic/models/decomp.py Outdated Show resolved Hide resolved
turftopic/models/cluster.py Outdated Show resolved Hide resolved
tests/test_integration.py Outdated Show resolved Hide resolved
tests/test_integration.py Show resolved Hide resolved
turftopic/models/decomp.py Outdated Show resolved Hide resolved
@rbroc rbroc force-pushed the dynamic-clustering branch from 608557c to 21944dd Compare March 20, 2024 15:23
@rbroc rbroc changed the title [WIP] add dynamic topic modeling to clustering models add dynamic topic modeling to clustering models Mar 20, 2024
Copy link
Owner

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!! Feel free to merge.

@rbroc rbroc merged commit a37354d into x-tabdeveloping:main Mar 21, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Dynamic topic modeling to clustering methods
2 participants