-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genUnicodeString generates invalid unicode #167
Comments
On a related note, it appears that the unicode |
It's wrong as a bound, but 65536 is just getting turned into 65535 via |
Does something like this seem right? At least it passes the utf-8 round trip |
Seems reasonable to me 👍 |
The unicode character generator for unicode characters is picking a random
CodePoint
in the BMP. The unicode string generator just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done
CodePoint
byCodePoint
.For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test here
The text was updated successfully, but these errors were encountered: