-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in extra penalty calculations in divvunspell #18
Comments
Intentional. To get behavior & weights resembling
|
Is there any documentation on how these weighting differences work? Is there any system to them? As seen by your examples, the weight is not changed by the same amount for all words, which means that the order of suggestions does not match the one given by For sjd, there is e.g. a edit distance rule
which is not the case when enabling
As seen, both lists contain the same corrections, but standard
I assume this is connected to the lines
in the configuration, although these penalties do not seems to match the actual penalties exactly, and I do not understand what it has got to do with case handling. If this behavior is intended, in which way are we expected to write our edit distance rules so that we still prioritize what we wish to prioritize? |
@trondtynnol you are correct in this:
but the name of the option is partly misleading. It does also affect case handling — it is turned off — but it also implies turning off the position-based extra weights.
It is, but the present system is crude, and definitely hard to debug, and it is completely undocumented outside the source code for The basic idea is this: we know from many studies that spelling errors are not evenly distributed throughout the words. Spelling errors are rare in first and last position, and typically most frequent in the middle of the word. This generalisation holds relatively well across languages, but I am not aware of any studies that look into how this plays out in morphology rich languages with mainly suffix morphology. In any case, the present implementation is a very rough first attempt at modelling this. What should be done to improve
@flammie could probably look into this, and anyone can help with documentation. |
@trondtynnol & @flammie the relevant code section seems to be this: divvunspell/divvunspell/src/speller/mod.rs Lines 230 to 261 in cb27a91
|
More specifically it seems to be a bug in these lines: divvunspell/divvunspell/src/speller/mod.rs Lines 246 to 249 in cb27a91
The bug:
Possible fix: chopping of the first and last letters of both the input and suggestion before calculating the value of Wit this bug fixed, the extra penalties for the first three cases above would be (buggy weight in parentheses):
which would reorder the first two suggestions:
Fixing this bug should be done first, and a new version released. @flammie could you do it? |
LAst I was debugging this I think that if the finite-state error model is crafted carefully it's probably best to zero out the programmatic weighting but it's probably been beneficial for spell checkers where the default edit distance etc is untouched. It will be easier to debug this and many rust-based stuffs if you all set I am gonna try to decouple the recasing and reweighting first a bit and see from there, maybe we need more metadata and settings for these all. |
My experience is that position-based error modelling is really hard to do well using FST's, so I see the programmatic weighting as a useful complement to the FST based error model. But we need documentation, and we need access to fine-tune the position-based weights to fit better with the rest of the error model. The present algorithm is also very crude, and might need replacement with something more sophisticated. There should be good proposals in the literature (not that I have any references right now).
Sounds good. |
After @flammie 's changes in 71d27ee (and the following commits), the echo куэдь|divvunspell suggest -n 10 -a tools/spellcheckers/sjd.zhfst
Reading from stdin...
Input: куэдь [INCORRECT]
куэдҍ 20.8799
куэжь 24.415596
куэкь 26.179182
куэда 31.87233
кӯль 35.916504
куэдтҍ 35.975212
кӯдҍ 36.179184
пуэдҍ 40.603645
нуэдҍ 41.026505
вуэдҍ 42.87233 which means that the desired suggestion is on top also for @trondtynnol it would be good if you too could test whether the latest changes in |
This (i.e. fixing the double penalty?) definitely seems to improve suggestions somewhat for our test material in general. Suggestions for our small palatalization corpus improved from
to
and our larger speller corpus with many complicated misspellings also improved slightly from
to
There are still weightings I do not quite understand how are calculated – the rules do not seem to match the suggestions – but this is probably down to something else than the position penalties. I have tried setting the penalties to zero, and that does worsen the suggestions, i.e. we should probably keep this feature also for sjd. However, reducing the end penalty from 10 to 6 does improve results somewhat, suggesting configurable position penalties would be a good thing. |
Thanks for the report, it confirms that the penalty bug fix is a real improvement. I suggest we ASAP release a bug fix version of |
I checked some other Sámi languages, the changes are mostly positive, I think sme gets -.1% shifted away from first position but that's probably alright and can be adjusted with weights. Some of the changes touch public datatypes, not sure if it needs changes in other software? |
It does need changes in |
Would it still be possible to run locally compiled spellchecking in Libreoffice on Linux if we abandon |
It would definitely be possible, but requires that we build a Linux version of the divvunspell oxt for LO. That should not be too hard, but must be done first. |
Compare the following two commands and their output:
What is strange about the weight differences is that the weights are encoded in the fst's (acceptor and error model). So the expectation would be that identical input should give identical weight for identical output.
On the surface, it looks like
divvunspell
is giving wrong weights — if one takes the acceptor weight of the suggestion + the weight of each editing operation, one comes close to the hfst-ospell weight:The lowest weight is the one used, and there are four editing operations applied to the input string, with the following weight:
hfst-ospell
is still 2 off, but that is nevertheless way closer thandivvunspell
s40.458984
.These differences are problematic for two reasons: it indicates a bug in the weight calculation, and it makes it hard to debug the suggestions and their ordering.
The text was updated successfully, but these errors were encountered: