-
Notifications
You must be signed in to change notification settings - Fork 5
Standardising run segmentation #37
Comments
Some prior work: Martin Hosken’s 2018 review of existing algorithms and issues: Specific issue with Common script bases: |
I definitely think this would be worthwhile. My inclination would be for this to happen in a Unicode context, since the algorithm would need to be driven by Unicode character properties---existing properties or perhaps new properties if needed. |
Useful overview: |
@PeterConstable and others have added more thoughts on this topic in a separate issue #44. |
I think we need to include other segmentation that happens between a text rendering API and cluster- or run-level shaping in this discussion, along with script and run segmentation. As Raph’s article (thank you @NeilSureshPatel for the reference!) points out, “a text layout engine breaks the input into finer and finer grains”, from paragraph to cluster. Any incorrect breaks introduced along the path can adversely affect shaping or even the text the user gets to see. One goal for this project therefore should be to identify the clusters and runs that must be kept intact for shaping to work correctly, and to ensure that these clusters and runs are indeed kept intact in all segmentation algorithms involved in text rendering. One example for the kind of errors we’re currently seeing is this (abbreviated) HTML document:
Browsers render this in different ways. Safari: The rendering of the first paragraph in Safari, Firefox, and Legacy Edge is correct. The rendering of that paragraph in Chrome is broken, as are all renderings of the second paragraph (with Safari getting it half right). The broken renderings often add dotted circles, which indicates that somewhere on the way to the shaping engine the original dotted circle in the text got separated from the marks that are attached to it, so that the shaping engine adds another one, or possibly two in the case of the two-part vowel. Some of the additional dotted circles clearly come from a font other than Noto Sans Javanese, indicating that the separation likely occurred during font fallback handling. In cases where the additional dotted circle comes from the Javanese font, or where there’s no additional dotted circle, other segmentation algorithms are more likely at fault. |
Another unusual segmentation case to be investigated: biscript orthographies, e.g. the use of Greek characters within Latin orthographies for First Nations languages in British Columbia. |
Script itemisation and run segmentation is the first step of OpenType Layout text processing, and like much else in OTL it lacks an implementation specification. Over the years, I have noted inconsistency in outcomes in different environments regarding handling of script=common characters such as punctuation at run boundaries, indicating that different algorithms are used by different implementers or that some algorithms may be broken (broken in terms of their developer’s intentions, since they can’t be said to be broken in terms of a non-existent specification). Lack of consistency in run segmentation sets up some OTL GSUB and GPOS for failure, since the font maker is unable to predict input to lookups, which are not applied across run boundaries.
It seems to me that standardising run segmentation may—as well as being practically useful—provide a useful test case for this group. It is a bite-size chunk of OTL processing, necessary but not overwhelming in scope. It would require applying the kinds of decisions that we have discussed in organisational meetings, re. determining the appropriate venue for the work, determining how it should be published, determining what input from current implementers is needed and available, negotiating possible breaking changes for some implementers, and development of both a written specification and a test suite.
The text was updated successfully, but these errors were encountered: