Skip to content

Commit

Permalink
Give explanation on normalization
Browse files Browse the repository at this point in the history
  • Loading branch information
gem-neo4j committed Jan 15, 2024
1 parent 14f5279 commit 5d44934
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions modules/ROOT/pages/functions/string.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,13 @@ RETURN ltrim(' hello')

`normalize()` returns the given `STRING` normalized using the `NFC` normalization form.

The `normalize()` function is useful for converting `STRING` values into comparable forms.
When comparing two `STRING` values, it is their Unicode codepoints that are compared.
In Unicode, a codepoint for a character that looks the same may actually be represented by two, or more, different codepoints.
For example, the character `<` can be represented as `\uFE64` or `\u003C`. Visually, the character may look the same,
but if compared, Cypher will return false as `\uFE64` does not equal `\u003C`. Using the `normalize()` function one can
normalize the codepoint `\uFE64` to `\u003C`, creating a single codepoint representation, allowing them to be successfully compared.

*Syntax:*

[source, syntax, role="noheader"]
Expand Down Expand Up @@ -218,6 +225,26 @@ RETURN normalize('\u212B') = '\u00C5' AS result
`normalize()` returns the given `STRING` normalized using the specified normalization form.
The normalization form can be of type `NFC`, `NFD`, `NFKC` or `NFKD`.

There are two main types of normalization forms. One is based on the concept of canonical equivalence, and the other is based on compatibility.

The two forms `NFC` (default) and `NFD` are forms of canonical equivalence. This means that codepoints which represent the same abstract character will
be normalized to the same codepoint. The same abstract character means that the character has the same visual appearance and behavior.
The difference between `NFC` and `NFD` is that `NFC` form will always give the *composed* canonical form, one where combined codes are replaced with the single representation, if possible.
Whereas, `NFD` gives the *decomposed* form, this is the opposite of composed, converting combined codepoints into the split form if possible.

The two forms `NFKC` and `NFKD` are forms of compatibility normalization. All canonically equivalent sequences are compatible, but not all compatible sequences are canonical.
This means that a character normalized in `NFC` or `NFD` should also be normalized in `NFKC` and `NFKD`, and other characters that may have a slightly different visual appearance,
but are considered close enough in appearance to be compatibly equivalent.

For example, the Greek Upsilon with Acute and Hook Symbol `ϓ` can be represented by the Unicode codepoint: `\u03D3`.

* Normalized in `NFC`: `\u03D3` Greek Upsilon with Acute and Hook Symbol (ϓ)
* Normalized in `NFD`: `\u03D2\u0301` Greek Upsilon with Hook Symbol + Combining Acute Accent (ϓ)
* Normalized in `NFKC`: `\u038E` Greek Capital Letter Upsilon with Tonos (Ύ)
* Normalized in `NFKD`: `\u03A5\u0301` Greek Capital Letter Upsilon + Combining Acute Accent (Ύ)

In the compatibility normalization forms (`NFKC` and `NFKD`) the character looks visually different as it no longer contains the hook symbol.

*Syntax:*

[source, syntax, role="noheader"]
Expand Down

0 comments on commit 5d44934

Please sign in to comment.