Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Modeling: Remove tags from display of topics #693

Merged
merged 1 commit into from
Aug 19, 2021

Conversation

ajdapretnar
Copy link
Collaborator

Issue

Fixes #689.
POS tags were shown alongside tokens when displaying top 10 words per topic.

Description of changes

Remove tags just for display, not for computing topics.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov-commenter
Copy link

codecov-commenter commented Jul 22, 2021

Codecov Report

Merging #693 (25a61da) into master (03df971) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #693      +/-   ##
==========================================
- Coverage   74.05%   74.03%   -0.02%     
==========================================
  Files          72       72              
  Lines        9447     9449       +2     
  Branches     1287     1287              
==========================================
  Hits         6996     6996              
- Misses       2209     2210       +1     
- Partials      242      243       +1     

@PrimozGodec
Copy link
Collaborator

It works until the token has underscored in it. For example, if one of the tokens is "abc_efg" and then when it is POS tagged it becomes "abc_efg_N" this solution will split this token on the first underscore.

The solution should be splitting on the second underscore, but it will cause issues in the case when tokens are not pos tagged and they will still be split.

Here comes the question do we need to use pos tags in the Topic Modeling process at all? In some way I think they should be there if the user decided to use them, but on the other side do they make any difference in results? @ajdapretnar as you said in the issue I think we should discuss where to use them and where not and then fix that across the whole addon.

@ajdapretnar
Copy link
Collaborator Author

Uf, I agree. It is indeed a suboptimal solution. I am honestly not sure what difference they make, but I assume they do make a difference. More in English than in other languages. It just comes down to differentiating between verbs and nouns.
This paper reports:

Tags and topics can be thought of as orthogonal to each other. It is important to note that in LDA the same unigram is used throughout the document whenever a given topic is about to generate a word. But the same topic can have different word distribution under different tags. Knowing the tags should allow us to build a better model than using the topic model alone.

I would say we should keep the differentiation and just use self.tokens for display if possible.

@PrimozGodec
Copy link
Collaborator

Looks good

@PrimozGodec PrimozGodec merged commit f306d03 into biolab:master Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Topic Modeling: Don't show pos_tags with top words
3 participants