Give explanation on normalization

neo4j · Jan 15, 2024 · 5d44934 · 5d44934
1 parent 14f5279
commit 5d44934
Showing 1 changed file with 27 additions and 0 deletions.
diff --git a/modules/ROOT/pages/functions/string.adoc b/modules/ROOT/pages/functions/string.adoc
@@ -155,6 +155,13 @@ RETURN ltrim('   hello')
 
 `normalize()` returns the given `STRING` normalized using the `NFC` normalization form.
 
+The `normalize()` function is useful for converting `STRING` values into comparable forms.
+When comparing two `STRING` values, it is their Unicode codepoints that are compared.
+In Unicode, a codepoint for a character that looks the same may actually be represented by two, or more, different codepoints.
+For example, the character `<` can be represented as `\uFE64` or `\u003C`. Visually, the character may look the same,
+but if compared, Cypher will return false as `\uFE64` does not equal `\u003C`. Using the `normalize()` function one can
+normalize the codepoint `\uFE64` to `\u003C`, creating a single codepoint representation, allowing them to be successfully compared.
+
 *Syntax:*
 
 [source, syntax, role="noheader"]
@@ -218,6 +225,26 @@ RETURN normalize('\u212B') = '\u00C5' AS result
 `normalize()` returns the given `STRING` normalized using the specified normalization form.
 The normalization form can be of type `NFC`, `NFD`, `NFKC` or `NFKD`.
 
+There are two main types of normalization forms. One is based on the concept of canonical equivalence, and the other is based on compatibility.
+
+The two forms `NFC` (default) and `NFD` are forms of canonical equivalence. This means that codepoints which represent the same abstract character will
+be normalized to the same codepoint. The same abstract character means that the character has the same visual appearance and behavior.
+The difference between `NFC` and `NFD` is that `NFC` form will always give the *composed* canonical form, one where combined codes are replaced with the single representation, if possible.
+Whereas, `NFD` gives the *decomposed* form, this is the opposite of composed, converting combined codepoints into the split form if possible.
+
+The two forms `NFKC` and `NFKD` are forms of compatibility normalization. All canonically equivalent sequences are compatible, but not all compatible sequences are canonical.
+This means that a character normalized in `NFC` or `NFD` should also be normalized in `NFKC` and `NFKD`, and other characters that may have a slightly different visual appearance,
+but are considered close enough in appearance to be compatibly equivalent.
+
+For example, the Greek Upsilon with Acute and Hook Symbol `ϓ` can be represented by the Unicode codepoint: `\u03D3`.
+
+* Normalized in `NFC`: `\u03D3` Greek Upsilon with Acute and Hook Symbol (ϓ)
+* Normalized in `NFD`: `\u03D2\u0301` Greek Upsilon with Hook Symbol + Combining Acute Accent (ϓ)
+* Normalized in `NFKC`: `\u038E` Greek Capital Letter Upsilon with Tonos (Ύ)
+* Normalized in `NFKD`: `\u03A5\u0301` Greek Capital Letter Upsilon + Combining Acute Accent (Ύ)
+
+In the compatibility normalization forms (`NFKC` and `NFKD`) the character looks visually different as it no longer contains the hook symbol.
+
 *Syntax:*
 
 [source, syntax, role="noheader"]