Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bag of Words stores occurrences of 'hci' in corpus as 'hcus' in sparse data (or is lemmatizer to blame?) #1041

Closed
wvdvegte opened this issue Mar 7, 2024 · 1 comment

Comments

@wvdvegte
Copy link

wvdvegte commented Mar 7, 2024

Describe the bug
For documents containing the string 'hci' (from human-computer interaction) in the corpus, Bag of Words changes 'hci' to 'hcus' in its sparse-matrix representation

To Reproduce
See attached workflow. Dataset is shared through Google Drive link

Expected behavior
"hci" should be kept as 'hci' in sparse data. Could this be some automatic conversion of latin plurals ending with '-i' to singular ending with '-us' (such as nuclei -> nucleus) caused by the lemmatizer?

Orange version:
3.36.2.

Text add-on version:
1.15.0
Screenshots
If applicable, add screenshots to help explain your problem.

Operating system:
Mac OS 14.3.1

Example workflow
hcus bug.ows.zip

@ajdapretnar
Copy link
Collaborator

Almost 100% certain it is the fault of the lemmatizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants