diff --git a/modules/ROOT/pages/functions/string.adoc b/modules/ROOT/pages/functions/string.adoc index 81eaf618e..c06d2cb5a 100644 --- a/modules/ROOT/pages/functions/string.adoc +++ b/modules/ROOT/pages/functions/string.adoc @@ -153,13 +153,19 @@ RETURN ltrim(' hello') [[functions-normalize]] == normalize() -`normalize()` returns the given `STRING` normalized using the `NFC` normalization form. +`normalize()` returns the given `STRING` normalized using the `NFC` Unicode normalization form. + +[NOTE] +==== +Unicode normalization is a process that transforms different representations of the same string into a standardized form. +For more information, see the documentation for link:https://unicode.org/reports/tr15/#Norm_Forms[Unicode normalization forms]. +==== The `normalize()` function is useful for converting `STRING` values into comparable forms. When comparing two `STRING` values, it is their Unicode codepoints that are compared. -In Unicode, a codepoint for a character that looks the same may actually be represented by two, or more, different codepoints. -For example, the character `<` can be represented as `\uFE64` or `\u003C`. Visually, the character may look the same, -but if compared, Cypher will return false as `\uFE64` does not equal `\u003C`. Using the `normalize()` function one can +In Unicode, a codepoint for a character that looks the same may be represented by two, or more, different codepoints. +For example, the character `<` can be represented as `\uFE64` (﹤) or `\u003C` (<). Visually, the character may look the same, +but if compared, Cypher will return false as `\uFE64` does not equal `\u003C`. Using the `normalize()` function, it is possible to normalize the codepoint `\uFE64` to `\u003C`, creating a single codepoint representation, allowing them to be successfully compared. *Syntax:* @@ -225,16 +231,18 @@ RETURN normalize('\u212B') = '\u00C5' AS result `normalize()` returns the given `STRING` normalized using the specified normalization form. The normalization form can be of type `NFC`, `NFD`, `NFKC` or `NFKD`. -There are two main types of normalization forms. One is based on the concept of canonical equivalence, and the other is based on compatibility. +There are two main types of normalization forms: -The two forms `NFC` (default) and `NFD` are forms of canonical equivalence. This means that codepoints which represent the same abstract character will -be normalized to the same codepoint. The same abstract character means that the character has the same visual appearance and behavior. -The difference between `NFC` and `NFD` is that `NFC` form will always give the *composed* canonical form, one where combined codes are replaced with the single representation, if possible. -Whereas, `NFD` gives the *decomposed* form, this is the opposite of composed, converting combined codepoints into the split form if possible. +* *Canonical equivalence*: The `NFC` (default) and `NFD` are forms of canonical equivalence. +This means that codepoints that represent the same abstract character will +be normalized to the same codepoint (and have the same appearance and behavior). +The `NFC` form will always give the *composed* canonical form (in which the combined codes are replaced with a single representation, if possible). +The`NFD` form gives the *decomposed* form (the opposite of the composed form, which converts the combined codepoints into a split form if possible). -The two forms `NFKC` and `NFKD` are forms of compatibility normalization. All canonically equivalent sequences are compatible, but not all compatible sequences are canonical. -This means that a character normalized in `NFC` or `NFD` should also be normalized in `NFKC` and `NFKD`, and other characters that may have a slightly different visual appearance, -but are considered close enough in appearance to be compatibly equivalent. +* *Compatability normalization*: `NFKC` and `NFKD` are forms of compatibility normalization. +All canonically equivalent sequences are compatible, but not all compatible sequences are canonical. +This means that a character normalized in `NFC` or `NFD` should also be normalized in `NFKC` and `NFKD`. +Other characters with only slight differences in appearance should be compatibly equivalent. For example, the Greek Upsilon with Acute and Hook Symbol `ϓ` can be represented by the Unicode codepoint: `\u03D3`.