Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in extra penalty calculations in divvunspell #18

Open
snomos opened this issue Mar 19, 2020 · 15 comments
Open

Bug in extra penalty calculations in divvunspell #18

snomos opened this issue Mar 19, 2020 · 15 comments
Assignees

Comments

@snomos
Copy link
Member

snomos commented Mar 19, 2020

Compare the following two commands and their output:

$ echo nikinitawiwapamew | divvunspell --always-suggest --zhfst tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
Reading from stdin...
Input: nikinitawiwapamew		[INCORRECT]
nikî-nitawi-wâpamâw		40.458984
naki-nitawi-wâpamêw		49.151657
naki-nitawi-wâpamâw		53.151657
nika-nitawi-wâpamâw		53.151657
nikê-nitawi-wâpamâw		53.151657
nikî-nitawi-asamâw		53.151657
nikî-natawi-wâpamâw		57.151657
nikî-nitawi-wanâmâw		57.151657
niwî-nitawi-wâpamâw		57.151657
kikî-nitawi-wâpamâw		67.15166

$ echo nikinitawiwapamew | hfst-ospell -S -b 16 tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
"nikinitawiwapamew" is NOT in the lexicon:
Corrections for "nikinitawiwapamew":
nikî-nitawi-wâpamâw    15.458984
naki-nitawi-wâpamêw    24.151655
nikî-nitawi-wanâmâw    27.151655
niwî-nitawi-wâpamâw    27.151655
kikî-nitawi-wâpamâw    27.151655
nikî-natawi-wâpamâw    27.151655
nika-nitawi-wâpamâw    28.151655
nikê-nitawi-wâpamâw    28.151655
nikî-nitawi-asamâw    28.151655
naki-nitawi-wâpamâw    28.151655
nikî-nitawi-wêpimâw    31.151655

What is strange about the weight differences is that the weights are encoded in the fst's (acceptor and error model). So the expectation would be that identical input should give identical weight for identical output.

On the surface, it looks like divvunspell is giving wrong weights — if one takes the acceptor weight of the suggestion + the weight of each editing operation, one comes close to the hfst-ospell weight:

$ echo nikî-nitawi-wâpamâw | hfst-lookup -q tools/spellcheckers/fstbased/desktop/hfst/acceptor.default.hfst 
nikî-nitawi-wâpamâw	nikî-nitawi-wâpamâw	9,458984
nikî-nitawi-wâpamâw	nikî-nitawi-wâpamâw	11,151655

The lowest weight is the one used, and there are four editing operations applied to the input string, with the following weight:

# strings.default.regex:
{in} -> {î-n}::1,
{iw} -> {i-w}::1;

## editdist.default.txt:
a	â	2

# final_strings.default.txt:
mew:mâw	4

9,458984 + 1 +1 + 2 + 4 = 17,458984

hfst-ospell is still 2 off, but that is nevertheless way closer than divvunspells 40.458984.

These differences are problematic for two reasons: it indicates a bug in the weight calculation, and it makes it hard to debug the suggestions and their ordering.

@snomos
Copy link
Member Author

snomos commented Mar 19, 2020

Intentional. To get behavior & weights resembling hfst-ospell, use the option --no-case-handling:

$ echo nikinitawiwapamew | divvunspell --no-case-handling --always-suggest --zhfst tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
Reading from stdin...
Input: nikinitawiwapamew		[INCORRECT]
nikî-nitawi-wâpamâw		15.458984
naki-nitawi-wâpamêw		23.151655
nîmi-nitawi-wâpamêw		23.151655
kikî-nitawi-wâpamâw		27.151655
naki-nitawi-wâpamâw		27.151655
nikî-natawi-wâpamâw		27.151655
nikî-nitawi-wanâmâw		27.151655
niwî-nitawi-wâpamâw		27.151655
nîmi-nitawi-wâpamâw		27.151655
nika-nitawi-wâpamâw		28.151655

@snomos snomos closed this as completed Mar 19, 2020
@trondtynnol
Copy link

Is there any documentation on how these weighting differences work? Is there any system to them? As seen by your examples, the weight is not changed by the same amount for all words, which means that the order of suggestions does not match the one given by hfst-ospell. It also seems to contradict the edit distance rules.

For sjd, there is e.g. a edit distance rule ь ҍ 1, but no rule for the pair д ж, but still normal divvunspell prioritizes the latter correction:

echo куэдь|divvunspell suggest -n 10 -a sjd.zhfst 
Reading from stdin...
Input: куэдь		[INCORRECT]
куэжь		24.414778
куэдҍ		25.879082
куэкь		26.178366
кӯль		35.915688
куэда		36.871513
куэдтҍ		40.974392
кӯдҍ		41.178368
пуэдҍ		50.53614
нуэдҍ		51.02569
вуэдҍ		52.871513

which is not the case when enabling --no-case-handling:

echo куэдь|divvunspell suggest -n 10 --no-case-handling -a sjd.zhfst 
Reading from stdin...
Input: куэдь		[INCORRECT]
куэдҍ		10.879083
кӯдҍ		16.178366
куэжь		19.414778
пуэдҍ		20.536139
кӯль		20.915686
куэдтҍ		20.974392
нуэдҍ		21.025686
куэкь		21.178366
куэда		21.871513
вуэдҍ		22.871513

As seen, both lists contain the same corrections, but standard divvunspell is adding a different number of points to each result, thus changing the ordering. Here I've listed the results in the order given with --no-case-handling, and indicated how many extra points divvunspell adds to each suggestion in normal mode:

куэдҍ		 + 15
кӯдҍ		 + 25
куэжь		 + 5
пуэдҍ		 + 30
кӯль		 + 15
куэдтҍ		 + 20
нуэдҍ		 + 30
куэкь		 + 5
куэда		 + 15
вуэдҍ		 + 30

I assume this is connected to the lines

"case_handling": {
    "start_penalty": 10,
    "end_penalty": 10,
    "mid_penalty": 5
  },

in the configuration, although these penalties do not seems to match the actual penalties exactly, and I do not understand what it has got to do with case handling.

If this behavior is intended, in which way are we expected to write our edit distance rules so that we still prioritize what we wish to prioritize?

@trondtynnol
Copy link

@snomos

@snomos
Copy link
Member Author

snomos commented Nov 8, 2024

@trondtynnol you are correct in this:

I assume this is connected to the lines

but the name of the option is partly misleading. It does also affect case handling — it is turned off — but it also implies turning off the position-based extra weights.

If this behavior is intended

It is, but the present system is crude, and definitely hard to debug, and it is completely undocumented outside the source code for divvunspell.

The basic idea is this: we know from many studies that spelling errors are not evenly distributed throughout the words. Spelling errors are rare in first and last position, and typically most frequent in the middle of the word. This generalisation holds relatively well across languages, but I am not aware of any studies that look into how this plays out in morphology rich languages with mainly suffix morphology. In any case, the present implementation is a very rough first attempt at modelling this.

What should be done to improve divvunspell in this regard is at least the following:

  • document the behaviour
  • make the added penalties configurable both on the command line and in the speller (.zhfst) files
  • debuggable: using a trace or debug option, the added positional weights should be printed, preferably along with the position(s) getting extra weight

@flammie could probably look into this, and anyone can help with documentation.

@snomos
Copy link
Member Author

snomos commented Nov 8, 2024

@trondtynnol & @flammie the relevant code section seems to be this:

match mode {
CaseMode::MergeAll => {
for sugg in suggestions.into_iter() {
let penalty_start =
if !sugg.value().starts_with(word.chars().next().unwrap()) {
case_handling.start_penalty
} else {
0.0
};
let penalty_end =
if !sugg.value().ends_with(word.chars().rev().next().unwrap()) {
case_handling.end_penalty
} else {
0.0
};
let distance =
strsim::damerau_levenshtein(&words[0].as_str(), &word.as_str())
+ strsim::damerau_levenshtein(&word.as_str(), sugg.value());
let penalty_middle = case_handling.mid_penalty * distance as f32;
let additional_weight = penalty_start + penalty_end + penalty_middle;
best.entry(sugg.value.clone())
.and_modify(|entry| {
let weight = sugg.weight + additional_weight;
if entry as &_ > &weight {
*entry = weight
}
})
.or_insert(sugg.weight + additional_weight);
}
}

@snomos
Copy link
Member Author

snomos commented Nov 8, 2024

More specifically it seems to be a bug in these lines:

let distance =
strsim::damerau_levenshtein(&words[0].as_str(), &word.as_str())
+ strsim::damerau_levenshtein(&word.as_str(), sugg.value());
let penalty_middle = case_handling.mid_penalty * distance as f32;

The bug:

  • when calculating the distance (=editing distance, if I read this correctly), it is calculated over the whole string, ie the first and last letter included. This means that first and last letter edits are penalised twice: both as separate first and last letter penalties, and secondly when the middle edit penalty is calculated. They should of course only be counted once.

Possible fix: chopping of the first and last letters of both the input and suggestion before calculating the value of distance, but that could probably give wrong results when the editing itself is removing the fisrt/last character. So care must be taken for those cases.

Wit this bug fixed, the extra penalties for the first three cases above would be (buggy weight in parentheses):

куэдҍ		 + 10		(+15)
кӯдҍ		 + 20		(+25)
куэжь		 + 5 		(+5)

which would reorder the first two suggestions:

куэдҍ		20.879082
куэжь		24.414778

Fixing this bug should be done first, and a new version released. @flammie could you do it?

@snomos snomos reopened this Nov 8, 2024
@snomos snomos assigned flammie and unassigned bbqsrc Nov 8, 2024
@snomos snomos changed the title Suggestions weight differences between divvunspell and hfst-ospell Bug in extra penalty calculations in divvunspell Nov 8, 2024
@flammie
Copy link
Contributor

flammie commented Nov 8, 2024

LAst I was debugging this I think that if the finite-state error model is crafted carefully it's probably best to zero out the programmatic weighting but it's probably been beneficial for spell checkers where the default edit distance etc is untouched.

It will be easier to debug this and many rust-based stuffs if you all set RUST_LOG=trace (when debugging and eventually keep it at DEBUG or something for the rest), most relevant info is already logged by the code.

I am gonna try to decouple the recasing and reweighting first a bit and see from there, maybe we need more metadata and settings for these all.

@snomos
Copy link
Member Author

snomos commented Nov 8, 2024

My experience is that position-based error modelling is really hard to do well using FST's, so I see the programmatic weighting as a useful complement to the FST based error model. But we need documentation, and we need access to fine-tune the position-based weights to fit better with the rest of the error model. The present algorithm is also very crude, and might need replacement with something more sophisticated. There should be good proposals in the literature (not that I have any references right now).

I am gonna try to decouple the recasing and reweighting first a bit and see from there, maybe we need more metadata and settings for these all.

Sounds good.

@snomos
Copy link
Member Author

snomos commented Nov 12, 2024

After @flammie 's changes in 71d27ee (and the following commits), the sjd test above gives the following output:

echo куэдь|divvunspell suggest -n 10 -a tools/spellcheckers/sjd.zhfst 
Reading from stdin...
Input: куэдь		[INCORRECT]
куэдҍ		20.8799
куэжь		24.415596
куэкь		26.179182
куэда		31.87233
кӯль		35.916504
куэдтҍ		35.975212
кӯдҍ		36.179184
пуэдҍ		40.603645
нуэдҍ		41.026505
вуэдҍ		42.87233

which means that the desired suggestion is on top also for divvunspell. It is probably too early to close this as fixed, more testing is needed, but it looks good.

@trondtynnol it would be good if you too could test whether the latest changes in divvunspell gives better suggestions.

@trondtynnol
Copy link

This (i.e. fixing the double penalty?) definitely seems to improve suggestions somewhat for our test material in general. Suggestions for our small palatalization corpus improved from

[#1] 76.10% [^5] 95.05% [any] 98.63% [none] 0.27% [wrong] 1.10%

to

[#1] 81.59% [^5] 96.70% [any] 98.63% [none] 0.27% [wrong] 1.10%

and our larger speller corpus with many complicated misspellings also improved slightly from

[#1] 62.46% [^5] 82.35% [any] 85.51% [none] 1.55% [wrong] 12.94%

to

[#1] 64.03% [^5] 82.85% [any] 85.53% [none] 1.55% [wrong] 12.92%

There are still weightings I do not quite understand how are calculated – the rules do not seem to match the suggestions – but this is probably down to something else than the position penalties.

I have tried setting the penalties to zero, and that does worsen the suggestions, i.e. we should probably keep this feature also for sjd. However, reducing the end penalty from 10 to 6 does improve results somewhat, suggesting configurable position penalties would be a good thing.

@snomos
Copy link
Member Author

snomos commented Nov 13, 2024

Thanks for the report, it confirms that the penalty bug fix is a real improvement. I suggest we ASAP release a bug fix version of divvunspell (1.0.1) with the present code, and then work on an incremental feature release (1.1.0) with the configurable position penalties feature. For the configurability to work, we also need to add support for it in the index*.xml file in the .zhfst archives, and make sure it is read from there, and also make sure the configuration is kept in the .thfs and .bhfst formats.

@flammie
Copy link
Contributor

flammie commented Nov 14, 2024

I checked some other Sámi languages, the changes are mostly positive, I think sme gets -.1% shifted away from first position but that's probably alright and can be adjusted with weights.

Some of the changes touch public datatypes, not sure if it needs changes in other software?

@snomos
Copy link
Member Author

snomos commented Nov 15, 2024

Some of the changes touch public datatypes, not sure if it needs changes in other software?

It does need changes in .zhfst, and the software that reads those files. Alternatively, we abandon the .zhfst format altogether, and move to .bhfst, both are supported by divvunspell, and .bhfst uses a simpler json file for metadata, where the new flags should be stored.

@trondtynnol
Copy link

Would it still be possible to run locally compiled spellchecking in Libreoffice on Linux if we abandon .zhfst ? That is a feature which is very useful for the current work on sjd.

@snomos
Copy link
Member Author

snomos commented Nov 15, 2024

It would definitely be possible, but requires that we build a Linux version of the divvunspell oxt for LO. That should not be too hard, but must be done first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants