Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-Escaping strings using a provided buffer #83

Merged
merged 14 commits into from
Aug 6, 2024

Conversation

sammhicks
Copy link
Contributor

Here's a partial implementation of the design suggested in #79 (comment)_

How about the following design?:

  • The deserializer takes a shared &str, as per the design before this pull request.
  • When deserializing JSON strings, the deserializer scans for escape sequences
    • If there are no escape sequences, it calls visitor.visit_borrowed_str(v), which will allow zero-copy deserialization
    • If there are escape sequences, it decodes the escape sequence using a provided buffer, and calls visitor.visit_str(v)
  • serde_json_core has a EscapedString newtype struct which contains an escaped &str, with utility methods to iterator over it, where the iterator returns either a &str of characters with no escape sequences, or an unescaped char
    • EscapedString is the only structure that is allowed to borrow escaped string data, and uses a special constant to signal to the deserializer that it's special.

I believe that this design will also solve #74

An example usage of the new design can be seen at https://github.com/sammhicks/picoserve/tree/json-parse-demo

src/de/mod.rs Outdated Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
Copy link
Member

@ryan-summers ryan-summers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this, its intriguing functionality. I have a few questions listed in-line below.

src/de/mod.rs Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
… the input when deserializing, and has an iterator of unescaped fragments. Useful for zero-copy deserialization.
src/de/mod.rs Outdated Show resolved Hide resolved
src/de/mod.rs Show resolved Hide resolved
src/de/mod.rs Show resolved Hide resolved
src/ser/mod.rs Show resolved Hide resolved
src/de/mod.rs Outdated Show resolved Hide resolved
src/str.rs Outdated Show resolved Hide resolved
src/str.rs Outdated Show resolved Hide resolved
@ryan-summers ryan-summers mentioned this pull request Aug 5, 2024
@ryan-summers
Copy link
Member

Pending a doc update here, this all looks good to me! Sorry for the sporadic reviews, I've been quite busy recently.

@sammhicks sammhicks changed the title Draft - De-Escaping strings using a provided buffer De-Escaping strings using a provided buffer Aug 6, 2024
Copy link
Member

@ryan-summers ryan-summers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@ryan-summers ryan-summers merged commit 534a12a into rust-embedded-community:master Aug 6, 2024
7 checks passed
@sammhicks
Copy link
Contributor Author

Any chance you could publish a release so that I can integrate JSON deserialization into picoserve?

@ryan-summers
Copy link
Member

Indeed, I'll publish a release as soon as #91 gets reviewed. I also realized that this only handles de-serializing escaped strings, but we still don't have any support for serializing them with escaping. I'll add an issue for this. Thanks for the work on this!

@jordens
Copy link
Contributor

jordens commented Aug 14, 2024

I'm unsure about the rationale behind this (very big) change. Why is this approach preferrable over the one in serde_json, where it supports zero-copy no-escape deserialization of &str and escape-capable deserialization of owned String? Why not let a target heapless::String be the "provided serialization buffer". It certainly can't be better in terms of lifetimes as they need to unify anyway.

@sammhicks
Copy link
Contributor Author

Why is this approach preferrable over the one in serde_json, where it supports zero-copy no-escape deserialization of &str

This can be achieved using EscapedStr, or if the source has no escape sequences. This is similar to serde_json, which only allows borrowing from the input of there are no escape sequences.

escape-capable deserialization of owned String?

Deserializing a struct with a alloc::string::String or heapless::String will involve their visitors being called with visit_str if the string is escaped or visit_borrowed_str, if it's not, which will cause the String to take ownership.

Why not let a target heapless::String be the "provided serialization buffer"

This would involve Deserializer having an additional generic parameter, which would be a breaking change.

@jordens
Copy link
Contributor

jordens commented Aug 14, 2024

This can be achieved using EscapedStr, or if the source has no escape sequences. This is similar to serde_json, which only allows borrowing from the input of there are no escape sequences.

Why is there a need for deviation from the serde_json behavior?

Isn't this case (EscapedStr) where you absolutely need a &str (as opposed to a heapless/alloc String<N>) and need to be escape-capable a bit exotic and much beyond what serde_json supports? (1) You end up having to maintain the scratch buffer along with the value. (2) The amount of copying/scanning looks to be the same as with a transient scratch buffer (3) You need to reason about the buffer size if there are multiple escaped strings borrowing from it.

if it's not, which will cause the String to take ownership.

Sure. That's inherent. String (heapless or std) always ends up owning its data. They only implement visit_str by copying and don's special-case visit_borrowed_str.

This would involve Deserializer having an additional generic parameter, which would be a breaking change.

That's not what I mean. I'm proposing (1) making your change look a bit more like the scratch Vec in serde_json (just support transient de-escaping: same purpose, same functionality), (2) fixing #74 by behaving like serde_json, having deserialize_str scan for escape sequences first and then dispatch to visit_borrowed_str if it can be borrowed, else try un-escape using the scratch buffer and call visit_str, else error out the same way serde_json does.

I agree that a scratch buffer is needed (and it should be optional and borrowed, not owned, just as you implement it). I just don't get the other deviations from the serde_json behavior. And I don't get EscapedStr.

@sammhicks
Copy link
Contributor Author

serde_json is designed for systems where it's reasonable to require applications to allocate heap memory to use functionality, serde_json_core is designed for systems where, for example, there may be enough memory to store an incoming payload, but not enough to also store buffers for storing strings.

Suppose that a user wishes to deserialize a structure containing a list of strings, some of which may include escape sequences when JSON encoded. Without EscapedStr, the user must use something along the lines of heapless::Vec<heapless::String<32>,16>, most of which is unused space, unless the list happens to be of length almost equal to 16, and all strings happen to be of length almost equal to 32. This structure must also exist at the same time as the buffer which the JSON encoded text was stored, and thus this code is not particularly space-efficient. If however, the data type becomes heapless::Vec<EscapedStr>, and as EscapedStr is a lot smaller than heapless::String<32>, the code is much more space-efficient.

@jordens
Copy link
Contributor

jordens commented Aug 14, 2024

Right. This is an allocator!

The real-world cases where the maximum sum of lengths is known to be significantly less than the sum of maximum lengths such that this makes a difference are exceedingly exotic.

It doesn't justify the practical disadvantages and unergonomics of having to keep the allocator arena alive and having to reason about its size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants