Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Corpus - remove dictionary and fix wrong types count on subsampled corpus #990

Merged
merged 6 commits into from
Aug 25, 2023

Conversation

PrimozGodec
Copy link
Collaborator

@PrimozGodec PrimozGodec commented Aug 1, 2023

Issue

Fixes #920
Corpus saves a dictionary (Gensim Dictionary) which is created on first need and cached. The problem with the dictionary is that it stays the same after subsampling Corpus (creating a corpus with the subset of documents) even though the number of unique tokens changes. The most problematic is that it was used to access a number of unique tokens in Corpus at different locations in the addon. The information was incorrect after the corpus was subsampled (issue in #920).

Description of changes

Since the dictionary was primarily introduced for Topic modelling purposes and topic modelling does not use it anymore, I decided to remove it from Corpus. All pieces of code that use a dictionary can be written differently.

This PR so removes the dictionary and updates all the code that uses it.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov-commenter
Copy link

codecov-commenter commented Aug 1, 2023

Codecov Report

Merging #990 (d76acaa) into master (9c0faca) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head d76acaa differs from pull request most recent head 946cab6. Consider uploading reports for the commit 946cab6 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #990      +/-   ##
==========================================
+ Coverage   79.66%   79.69%   +0.02%     
==========================================
  Files          87       87              
  Lines       12319    12326       +7     
  Branches     1617     1620       +3     
==========================================
+ Hits         9814     9823       +9     
+ Misses       2211     2210       -1     
+ Partials      294      293       -1     

@PrimozGodec PrimozGodec force-pushed the fix-corpus-info branch 2 times, most recently from 6628a38 to 2b48e94 Compare August 1, 2023 08:40
@PrimozGodec PrimozGodec marked this pull request as ready for review August 1, 2023 09:51
@markotoplak markotoplak merged commit 7de9c65 into biolab:master Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong information about the number of tokens and types
3 participants