Stop using decodeUtf8 #53

pbrisbin · 2024-07-24T18:39:23Z

We use decodeUtf8 in a few places, which will throw on invalid data. Since we don't know what we're logging, and us throwing would be pretty disruptive, let's not do that. decodeUtf8With lenientDecode should be better, as that'll insert replacement characters instead.

The text was updated successfully, but these errors were encountered:

benjaminweb · 2024-07-25T19:06:10Z

Nah, if the invalid data is due to the data being latin1, why not falling back to it?

utf8OrLatin1ToText :: BL.ByteString -> Text
utf8OrLatin1ToText bs = case decodeUtf8' bs of
        Left _ -> decodeLatin1 bs
        Right x -> x

pbrisbin · 2024-07-25T19:33:30Z

if the invalid data is due to the data being latin1

Can you elaborate on why we should assume (or optimize for) that?

benjaminweb · 2024-07-25T21:07:37Z

My assumption might be wrong here (wrongly assuming it’s HTML, where it mostly is latin1 or utf8). Coming only from my recent experience with it with HTML.

What’s the cause of the invalid data in your case? Is it something different than utf8 or latin1?

pbrisbin · 2024-07-25T21:39:05Z

I haven't encountered a known case, this was more of a hypothetical that the content "could be anything" since we don't control it as the 3rd-party library. At least, that's how I interpreted it. It was raised here by @chris-martin.

So, if the nature of this issue was "this can be anything", then a fix that assumed (or favored) latin1 felt like it was kind of missing the point.

But I thank you for your comment because it did get me to take a second look here. And this whole Issue is kind of silly.

This function is currently used on:

request method, which can only ever be GET, POST, etc...
response status message, which can only ever be OK, CREATED, SEE OTHER, etc
request path and query, which (as per the URI RFC) would be "composed from a limited set of characters consisting of digits, letters, and a few graphic symbols"

I think it's safe to say our current uses of decodeUtf8 will never see non-utf8 characters.

Given that, I'd be OK with using the throwing-version of decodeUtf8 here. But I guess we aren't even doing that, we were already using an alias that is decodeUtf8With lenientDecode anyway:

blammo/Blammo-wai/src/Network/Wai/Middleware/Logging.hs

Lines 227 to 228 in ca12432

    
           decodeUtf8 :: ByteString -> Text 
        
           decodeUtf8 = decodeUtf8With lenientDecode

pbrisbin added this to Open Source Jul 24, 2024

github-project-automation bot moved this to 👜 To do in Open Source Jul 24, 2024

pbrisbin added the good first issue Good for newcomers label Jul 24, 2024

pbrisbin moved this from 👜 To do to 👷 Shovel ready in Open Source Jul 24, 2024

pbrisbin closed this as completed Jul 25, 2024

github-project-automation bot moved this from 👷 Shovel ready to ✅ Done in Open Source Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using decodeUtf8 #53

Stop using decodeUtf8 #53

pbrisbin commented Jul 24, 2024

benjaminweb commented Jul 25, 2024

pbrisbin commented Jul 25, 2024

benjaminweb commented Jul 25, 2024

pbrisbin commented Jul 25, 2024

Stop using decodeUtf8 #53

Stop using decodeUtf8 #53

Comments

pbrisbin commented Jul 24, 2024

benjaminweb commented Jul 25, 2024

pbrisbin commented Jul 25, 2024

benjaminweb commented Jul 25, 2024

pbrisbin commented Jul 25, 2024