Topic Modeling: Remove tags from display of topics #693

ajdapretnar · 2021-07-22T12:37:40Z

Issue

Fixes #689.
POS tags were shown alongside tokens when displaying top 10 words per topic.

Description of changes

Remove tags just for display, not for computing topics.

Includes

Code changes
Tests
Documentation

codecov-commenter · 2021-07-22T12:49:32Z

Codecov Report

Merging #693 (25a61da) into master (03df971) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #693      +/-   ##
==========================================
- Coverage   74.05%   74.03%   -0.02%     
==========================================
  Files          72       72              
  Lines        9447     9449       +2     
  Branches     1287     1287              
==========================================
  Hits         6996     6996              
- Misses       2209     2210       +1     
- Partials      242      243       +1

PrimozGodec · 2021-08-04T13:55:12Z

It works until the token has underscored in it. For example, if one of the tokens is "abc_efg" and then when it is POS tagged it becomes "abc_efg_N" this solution will split this token on the first underscore.

The solution should be splitting on the second underscore, but it will cause issues in the case when tokens are not pos tagged and they will still be split.

Here comes the question do we need to use pos tags in the Topic Modeling process at all? In some way I think they should be there if the user decided to use them, but on the other side do they make any difference in results? @ajdapretnar as you said in the issue I think we should discuss where to use them and where not and then fix that across the whole addon.

ajdapretnar · 2021-08-04T14:07:52Z

Uf, I agree. It is indeed a suboptimal solution. I am honestly not sure what difference they make, but I assume they do make a difference. More in English than in other languages. It just comes down to differentiating between verbs and nouns.
This paper reports:

Tags and topics can be thought of as orthogonal to each other. It is important to note that in LDA the same unigram is used throughout the document whenever a given topic is about to generate a word. But the same topic can have different word distribution under different tags. Knowing the tags should allow us to build a better model than using the topic model alone.

I would say we should keep the differentiation and just use self.tokens for display if possible.

PrimozGodec · 2021-08-19T09:02:51Z

Looks good

ajdapretnar force-pushed the topic-tag-words branch from c662b07 to b2745e2 Compare August 19, 2021 07:48

Remove tags from display of topics

25a61da

ajdapretnar force-pushed the topic-tag-words branch from b2745e2 to 25a61da Compare August 19, 2021 07:56

PrimozGodec merged commit f306d03 into biolab:master Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic Modeling: Remove tags from display of topics #693

Topic Modeling: Remove tags from display of topics #693

ajdapretnar commented Jul 22, 2021

codecov-commenter commented Jul 22, 2021 •

edited

Loading

PrimozGodec commented Aug 4, 2021

ajdapretnar commented Aug 4, 2021

PrimozGodec commented Aug 19, 2021

Topic Modeling: Remove tags from display of topics #693

Topic Modeling: Remove tags from display of topics #693

Conversation

ajdapretnar commented Jul 22, 2021

Issue

Description of changes

Includes

codecov-commenter commented Jul 22, 2021 • edited Loading

Codecov Report

PrimozGodec commented Aug 4, 2021

ajdapretnar commented Aug 4, 2021

PrimozGodec commented Aug 19, 2021

codecov-commenter commented Jul 22, 2021 •

edited

Loading