Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

Open
mbrock opened this issue Mar 1, 2023 · 0 comments
Open

Comments

@mbrock
Copy link
Contributor

mbrock commented Mar 1, 2023

The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

To escape an extended character that is not in the Basic Multilingual                               
Plane, the character is represented as a twelve-character sequence,                                 
encoding the UTF-16 surrogate pair.  So, for example, a string                                      
containing only the G clef character (U+1D11E) may be represented as                                
"\uD834\uDD1E".

But SWI-Prolog's JSON parser reads that string as two (invalid) characters.

I have fixed this in my fork and will submit a pull request.

mbrock added a commit to mbrock/swipl-http that referenced this issue Mar 1, 2023
The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be
encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair.  So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".

This commit fixes the JSON parser to handle such surrogate pairs.
JanWielemaker pushed a commit that referenced this issue Mar 2, 2023
The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be
encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair.  So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".

This commit fixes the JSON parser to handle such surrogate pairs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant