Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for deserializing list-encoded JSON structs [#6558] #6643

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jagill
Copy link

@jagill jagill commented Oct 29, 2024

Which issue does this PR close?

Closes #6558.

Rationale for this change

Currently, a StructArray can only be deserialized from a JSON object (e.g. {a: 1, b: "c"}), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. [1, "c"]) because this is more compact, and the schema is known.
Arrow-json cannot currently deserialize these.

What changes are included in this PR?

This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior.

Are there any user-facing changes?

Users may set the StructParsingMode enum to ListOnly to parse list-style structs. The associated enum,
variants, and method have been documented. I'm happy to update any other documentation.

Discussion topics

  1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would true be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.
  2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route.
  3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions.
  4. One requirement was that benchmarks do not regress. My running of benchmarks have been inconclusive (see https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are often small regressions or improvements in the single-digit % range whenever I switch between master and this PR. I suspect they are statistical but I wanted to note these.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 29, 2024
@jagill jagill force-pushed the json-struct-from-list branch from 8fc7592 to 7181f92 Compare October 29, 2024 15:20
@tustvold
Copy link
Contributor

tustvold commented Nov 8, 2024

Sorry this is on my list to look into, but I've struggled to find time to look into it yet... I appreciate your patience and restraint from repeatedly tagging me (unlike some people are wont to do).

@jagill
Copy link
Author

jagill commented Dec 5, 2024

Are there any changes I could make to this PR to make it easier to review?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @jagill and i very much apologize for the delay in reviewing -- we are always looking for more help:

TLDR thoughts:

  1. Are you planning / willing to implement serializing support too (aka support in json_writer)? That would then allow us to verify that data can be round-tripped
  2. I believe this PR slows down performance (notes below)

API

I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would true be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.

I agree that a struct is nice for the reasons you mention

Errors

I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route.

I think improved error messages are always better, but we can do them as a follow on PR

I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions.

💯

Performance

One requirement was that benchmarks do not regress. My running of benchmarks have been inconclusive (see https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are often small regressions or improvements in the single-digit % range whenever I switch between master and this PR. I suspect they are statistical but I wanted to note these.

I am not sure what benchmarks you ran -- I think the most relevant ones are in arrow/src/json_reader.rs

I ran them like cargo bench --bench json_reader

Here are three back to back runs I did on my test bench:

++ critcmp master json-struct-from-list
group                    json-struct-from-list                  master
-----                    ---------------------                  ------
large_bench_primitive    1.17      4.2±0.05ms        ? ?/sec    1.00      3.6±0.01ms        ? ?/sec
small_bench_list         1.03     14.5±0.05µs        ? ?/sec    1.00     14.1±0.17µs        ? ?/sec
small_bench_primitive    1.00      7.2±0.04µs        ? ?/sec    1.00      7.2±0.02µs        ? ?/sec
group                    json-struct-from-list                  master
-----                    ---------------------                  ------
large_bench_primitive    1.16      4.2±0.03ms        ? ?/sec    1.00      3.6±0.01ms        ? ?/sec
small_bench_list         1.03     14.5±0.05µs        ? ?/sec    1.00     14.0±0.13µs        ? ?/sec
small_bench_primitive    1.00      7.2±0.03µs        ? ?/sec    1.01      7.2±0.03µs        ? ?/sec
++ critcmp master json-struct-from-list
group                    json-struct-from-list                  master
-----                    ---------------------                  ------
large_bench_primitive    1.15      4.2±0.01ms        ? ?/sec    1.00      3.7±0.01ms        ? ?/sec
small_bench_list         1.00     14.4±0.10µs        ? ?/sec    1.00     14.4±0.08µs        ? ?/sec
small_bench_primitive    1.00      7.1±0.03µs        ? ?/sec    1.06      7.6±0.05µs        ? ?/sec
++ popd
~/datafusion-benchmarking

It shows a pretty consistent slowdown in large_bench_primitive -- I left some suggestions on how to improve this

Again, sorry for the delay in reviewing

@@ -71,42 +75,70 @@ impl ArrayDecoder for StructArrayDecoder {
.then(|| BooleanBufferBuilder::new(pos.len()));

for (row, p) in pos.iter().enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance wise, this code now has another conditional in the hot loop (each row)

        for (row, p) in pos.iter().enumerate() {
         if self.struct_parse_mode { 
           ..
        }
      }

I suspect the code would be fast (and easier to understand / cleaner) if you hoisted the check, something like

         if self.struct_parse_mode { 
            for (row, p) in pos.iter().enumerate() {
             ..
            }
          else {
                   for (row, p) in pos.iter().enumerate() {
             ..
            }
      }

@@ -170,12 +170,30 @@ mod struct_array;
mod tape;
mod timestamp_array;

/// Specifies what is considered valid JSON when parsing StructArrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link to documentation / prior art for this type of JSON encoding of structs? Maybe to trino / presto docs that describe it in more full detail that we can add to this doc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, the documentation of the format of the returned data is a little lacking. They describe that the top-level row is a list with one entry per column (https://prestodb.io/docs/current/develop/client-protocol.html#important-queryresults-attributes, https://trino.io/docs/current/develop/client-protocol.html#important-queryresults-attributes), but do not specify how any structural elements are serialized (or non-structural things like Datetimes etc). If it helps, I could add this to the Presto documentation.

Currently, a StructArray can only be deserialized from a JSON object
(e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino)
encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more
compact, and the schema is known.

This PR adds the ability to parse JSON lists into StructArrays, if the
StructParseMode is set to ListOnly.  In ListOnly mode, object-encoded
structs raise an error.  Setting to ObjectOnly (the default) has the
original parsing behavior.

Some notes/questions/points for discussion:
1. I've made a JsonParseMode struct instead of a bool flag for two
   reasons.  One is that it's self-descriptive (what would `true` be?),
   and the other is that it allows a future Mixed mode that could
   deserialize either.  The latter isn't currently requested by anyone.
2. I kept the error messages as similar to the old messages as possible.
   I considered having more specific error messages (like "Encountered a
   '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or
   similar), but wanted to hear opinions before I went that route.
3. I'm not attached to any name/code-style/etc, so happy to modify to
   fit local conventions.

Fixes apache#6558
@jagill jagill force-pushed the json-struct-from-list branch from 7181f92 to c3b5fc9 Compare December 18, 2024 16:24
@jagill
Copy link
Author

jagill commented Dec 18, 2024

Ok, I've hoisted the conditional outside the loop. It also had the effect (as you predicted) that the gnarly match statement for the delimiters is a bit easier to understand.

  1. Are you planning / willing to implement serializing support too (aka support in json_writer)? That would then allow us to verify that data can be round-tripped

I was not planning to, because there was no use-case requiring it. But I'm certainly willing to, both for symmetry and for round-tripping. Is that something you'd want in this PR, or a follow-up?

  1. I believe this PR slows down performance (notes below)

When I check, I'm consistently seeing no difference between the main branch and either of the variants:

❯ critcmp main inner-conditional
group                    inner-conditional                      main
-----                    -----------------                      ----
large_bench_primitive    1.00      2.7±0.01ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
small_bench_list         1.01      8.4±0.09µs        ? ?/sec    1.00      8.2±0.06µs        ? ?/sec
small_bench_primitive    1.01      4.6±0.03µs        ? ?/sec    1.00      4.6±0.04µs        ? ?/sec

❯ critcmp main outer-conditional
group                    outer-conditional                      main
-----                    -----------------                      ----
large_bench_primitive    1.00      2.7±0.03ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
small_bench_list         1.00      8.3±0.11µs        ? ?/sec    1.00      8.2±0.06µs        ? ?/sec
small_bench_primitive    1.00      4.6±0.05µs        ? ?/sec    1.00      4.6±0.04µs        ? ?/sec

❯ critcmp inner-conditional outer-conditional
group                    outer-conditional                      inner-conditional
-----                    -----------------                      -----------------
large_bench_primitive    1.00      2.7±0.03ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
small_bench_list         1.00      8.3±0.11µs        ? ?/sec    1.01      8.4±0.09µs        ? ?/sec
small_bench_primitive    1.00      4.6±0.05µs        ? ?/sec    1.01      4.6±0.03µs        ? ?/sec

(btw, I didn't know about critcmp; thanks for the pointer!)

However, I'm not running in an isolated environment, so the differences I saw in the past seemed to be related to other processes. I'm running on a M2 Mac and haven't tried on with x86 or other architectures, and haven't been able to isolate from the helpful random processes that Macos spins up, so I'll trust your benchmarks.

@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

Thank you very much again @jagill for this contribution.

Sorry for the delay in reviewing / that this PR is somewhat stuck

From my perspective having partial support for a feature (list encoded json structs) is non ideal and will increase the maintenance burden if we are not careful. I think it runs the risk of being a little used / unknown feature.

So in order to move this forward I think we need:

  1. Additional documentation explaining the feature along with its rationale and an example. Perhaps as an example on https://docs.rs/arrow-json/54.0.0/arrow_json/reader/struct.ReaderBuilder.html -- for rationale you could repurpose some of the comments / dicussion in this PR linked to the trino docs, etc
  2. File a ticket (maybe even prepare a PR) to track the work to add writing this version of JSON

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow JSON deserialization of StructArray from JSON List
3 participants