Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using decodeUtf8 #53

Closed
pbrisbin opened this issue Jul 24, 2024 · 4 comments
Closed

Stop using decodeUtf8 #53

pbrisbin opened this issue Jul 24, 2024 · 4 comments
Labels
good first issue Good for newcomers

Comments

@pbrisbin
Copy link
Member

We use decodeUtf8 in a few places, which will throw on invalid data. Since we don't know what we're logging, and us throwing would be pretty disruptive, let's not do that. decodeUtf8With lenientDecode should be better, as that'll insert replacement characters instead.

@github-project-automation github-project-automation bot moved this to 👜 To do in Open Source Jul 24, 2024
@pbrisbin pbrisbin added the good first issue Good for newcomers label Jul 24, 2024
@pbrisbin pbrisbin moved this from 👜 To do to 👷 Shovel ready in Open Source Jul 24, 2024
@benjaminweb
Copy link

Nah, if the invalid data is due to the data being latin1, why not falling back to it?

utf8OrLatin1ToText :: BL.ByteString -> Text
utf8OrLatin1ToText bs = case decodeUtf8' bs of
        Left _ -> decodeLatin1 bs
        Right x -> x

@pbrisbin
Copy link
Member Author

if the invalid data is due to the data being latin1

Can you elaborate on why we should assume (or optimize for) that?

@benjaminweb
Copy link

My assumption might be wrong here (wrongly assuming it’s HTML, where it mostly is latin1 or utf8). Coming only from my recent experience with it with HTML.

What’s the cause of the invalid data in your case? Is it something different than utf8 or latin1?

@pbrisbin
Copy link
Member Author

I haven't encountered a known case, this was more of a hypothetical that the content "could be anything" since we don't control it as the 3rd-party library. At least, that's how I interpreted it. It was raised here by @chris-martin.

So, if the nature of this issue was "this can be anything", then a fix that assumed (or favored) latin1 felt like it was kind of missing the point.

But I thank you for your comment because it did get me to take a second look here. And this whole Issue is kind of silly.

This function is currently used on:

  • request method, which can only ever be GET, POST, etc...
  • response status message, which can only ever be OK, CREATED, SEE OTHER, etc
  • request path and query, which (as per the URI RFC) would be "composed from a limited set of characters consisting of digits, letters, and a few graphic symbols"

I think it's safe to say our current uses of decodeUtf8 will never see non-utf8 characters.

Given that, I'd be OK with using the throwing-version of decodeUtf8 here. But I guess we aren't even doing that, we were already using an alias that is decodeUtf8With lenientDecode anyway:

decodeUtf8 :: ByteString -> Text
decodeUtf8 = decodeUtf8With lenientDecode

@github-project-automation github-project-automation bot moved this from 👷 Shovel ready to ✅ Done in Open Source Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants