-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialisation and topWords info #127
base: master
Are you sure you want to change the base?
Conversation
Thank you for all of this! Some comments: Could you say more about the Lexer -> Pattern shift in CharSequence2TokenSequence? It looks like the validateTopics function is adding stopwords during training? Is there a reference for this? I'm reluctant to make something available without fully understanding when users should and shouldn't use it. I'm planning to release the HPPC version as 2.1, I'd like to see this as part of it. |
Hi David, To make the About the validateTopics function, the idea is to create a list of stopwords, in an iterative way, based on those words appearing as top-words in multiple topics. This is similar to apply TF/IDF on Topics instead of Documents. I hope it was helpful. |
Minor changes in serialisation process and added a method to get top words along with their weights per topic