Incorrect parsing of JSON strings with surrogate pair escape sequences #158

mbrock · 2023-03-01T22:19:39Z

The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

To escape an extended character that is not in the Basic Multilingual                               
Plane, the character is represented as a twelve-character sequence,                                 
encoding the UTF-16 surrogate pair.  So, for example, a string                                      
containing only the G clef character (U+1D11E) may be represented as                                
"\uD834\uDD1E".

But SWI-Prolog's JSON parser reads that string as two (invalid) characters.

I have fixed this in my fork and will submit a pull request.

The text was updated successfully, but these errors were encountered:

The JSON string "\ud83d\udc95" has one codepoint, not two. This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair". From RFC 4627: > To escape an extended character that is not in the Basic Multilingual > Plane, the character is represented as a twelve-character sequence, > encoding the UTF-16 surrogate pair. So, for example, a string > containing only the G clef character (U+1D11E) may be represented as > "\uD834\uDD1E". This commit fixes the JSON parser to handle such surrogate pairs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

mbrock commented Mar 1, 2023

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

Comments

mbrock commented Mar 1, 2023