Skip to content

Commit

Permalink
PR review changes
Browse files Browse the repository at this point in the history
  • Loading branch information
gem-neo4j committed Jan 15, 2024
1 parent 5d44934 commit 215eb9c
Showing 1 changed file with 20 additions and 12 deletions.
32 changes: 20 additions & 12 deletions modules/ROOT/pages/functions/string.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -153,13 +153,19 @@ RETURN ltrim(' hello')
[[functions-normalize]]
== normalize()

`normalize()` returns the given `STRING` normalized using the `NFC` normalization form.
`normalize()` returns the given `STRING` normalized using the `NFC` Unicode normalization form.

[NOTE]
====
Unicode normalization is a process that transforms different representations of the same string into a standardized form.
For more information, see the documentation for link:https://unicode.org/reports/tr15/#Norm_Forms[Unicode normalization forms].
====

The `normalize()` function is useful for converting `STRING` values into comparable forms.
When comparing two `STRING` values, it is their Unicode codepoints that are compared.
In Unicode, a codepoint for a character that looks the same may actually be represented by two, or more, different codepoints.
For example, the character `<` can be represented as `\uFE64` or `\u003C`. Visually, the character may look the same,
but if compared, Cypher will return false as `\uFE64` does not equal `\u003C`. Using the `normalize()` function one can
In Unicode, a codepoint for a character that looks the same may be represented by two, or more, different codepoints.
For example, the character `<` can be represented as `\uFE64` (﹤) or `\u003C` (<). Visually, the character may look the same,
but if compared, Cypher will return false as `\uFE64` does not equal `\u003C`. Using the `normalize()` function, it is possible to
normalize the codepoint `\uFE64` to `\u003C`, creating a single codepoint representation, allowing them to be successfully compared.

*Syntax:*
Expand Down Expand Up @@ -225,16 +231,18 @@ RETURN normalize('\u212B') = '\u00C5' AS result
`normalize()` returns the given `STRING` normalized using the specified normalization form.
The normalization form can be of type `NFC`, `NFD`, `NFKC` or `NFKD`.

There are two main types of normalization forms. One is based on the concept of canonical equivalence, and the other is based on compatibility.
There are two main types of normalization forms:

The two forms `NFC` (default) and `NFD` are forms of canonical equivalence. This means that codepoints which represent the same abstract character will
be normalized to the same codepoint. The same abstract character means that the character has the same visual appearance and behavior.
The difference between `NFC` and `NFD` is that `NFC` form will always give the *composed* canonical form, one where combined codes are replaced with the single representation, if possible.
Whereas, `NFD` gives the *decomposed* form, this is the opposite of composed, converting combined codepoints into the split form if possible.
* *Canonical equivalence*: The `NFC` (default) and `NFD` are forms of canonical equivalence.
This means that codepoints that represent the same abstract character will
be normalized to the same codepoint (and have the same appearance and behavior).
The `NFC` form will always give the *composed* canonical form (in which the combined codes are replaced with a single representation, if possible).
The`NFD` form gives the *decomposed* form (the opposite of the composed form, which converts the combined codepoints into a split form if possible).

The two forms `NFKC` and `NFKD` are forms of compatibility normalization. All canonically equivalent sequences are compatible, but not all compatible sequences are canonical.
This means that a character normalized in `NFC` or `NFD` should also be normalized in `NFKC` and `NFKD`, and other characters that may have a slightly different visual appearance,
but are considered close enough in appearance to be compatibly equivalent.
* *Compatability normalization*: `NFKC` and `NFKD` are forms of compatibility normalization.
All canonically equivalent sequences are compatible, but not all compatible sequences are canonical.
This means that a character normalized in `NFC` or `NFD` should also be normalized in `NFKC` and `NFKD`.
Other characters with only slight differences in appearance should be compatibly equivalent.

For example, the Greek Upsilon with Acute and Hook Symbol `ϓ` can be represented by the Unicode codepoint: `\u03D3`.

Expand Down

0 comments on commit 215eb9c

Please sign in to comment.