-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add substrait tpch round trip tests from sql query #13888
base: main
Are you sure you want to change the base?
Add substrait tpch round trip tests from sql query #13888
Conversation
l_linestatus | ||
ORDER BY | ||
l_returnflag, | ||
l_linestatus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that there are already have copies of the TCPH queries in the repo:
https://github.com/apache/datafusion/tree/main/benchmarks/queries
I think it would be better to re-use them if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the schemas are also available as well.
datafusion/benchmarks/src/tpch/mod.rs
Lines 44 to 45 in 405b99c
/// Get the schema for the benchmarks derived from TPC-H | |
pub fn get_tpch_table_schema(table: &str) -> Schema { |
There are already some tests for TPCH functionality in https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/consumer_integration.rs, but IMO those are weaker than what you're proposing because they only verify that we can read the TPCH plans that have been generated in https://github.com/substrait-io/consumer-testing/tree/7c1f5f1876f00c2685f722b592dbd00030662d5d/substrait_consumer/testdata/integration/tpch The full roundtrip you're proposing is more comprehensive and lets us find bugs in both consumer and producer.
It looks like your approach has already found some missing features 🐛 🔨
In many cases plans roundtrip the same, but not necessarily in every case. DataFusion might consume a plan and then emit a plan that is semantically equivalent but has different nodes. For example, all emit kind remaps might be converted into projections. It is a desirable property that plans roundtrip the same, and probably something we can work towards in most cases. |
I like the idea, more testing the better! We already have some Substrait TCP testing, but I think that's from "known Substrait" -> DF, so it only tests the consumer, while this would test the producer as well. I think roughly there should be three levels of "equality":
I feel like (2) should be "easier" check than (3), though the tests you have here show that some plans would pass unoptimized but not optimized, which confuses me. Overall, I think (3) is nice to have, but not something I'd spent too much effort for - it doesn't really bring that much benefits to users compared to (2). (2) is nice compared to (1) in that it guarantees a bit more correctness, but in many cases it's unfortunately tough to support even (2) (even in theory, roundtrip isn't lossless, since multiple DF plans may result in the same Substrait plan, and the other way around too). So I'd suggest splitting the TPCH tests into 4 categories:
That allows merging the tests, preventing regressions, and tracking needed improvements :) |
Haha, looks like @vbarua commented pretty pretty much the same thing while I was writing my own reply! 😄 |
Thank you @vbarua and @Blizzara for your review and comments! Yes, I think the existing tests do not go far enough, and I encountered bugs not covered by these tests already, which is why i wrote these. I did not notice the existing schemas and sql in the benchmarks crate. I'll look to reuse them. Should we move them somewhere so that bench and test can access them without referring to each other? Lastly, is it convention that we would add |
I think that is a reasonable approach -- thank you. Please also leave a link to the relevant github issue in the comments so future readers can find the relevant issue |
(I really like the idea of merging the tests, even if they don't all pass, in one PR and then working on fixes to the tests as additional follow on PRs) |
Which issue does this PR close?
I've been investigating and experimenting with federating tpch query plans and sending the federated portion downstream encoded as substrait protos (Side node, is this useful or worth sharing?). When doing this I discovered the bug in issue #13860 . CC @alamb @Blizzara as we discussed the PR for that issue, and your input would be appreciated here as well.
In the course of that work, I discovered that there isn't testing coverage for round trip from TPCH query -> logical plan -> substrait -> logical plan.
My understanding of how do this is the following:
After this the recreated plan should be the same and produce the same result when executed. This is my understanding from looking at tests for round trip in to substrait for arbitrary queries from https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L1242-L1266
This has worked for me when federating portions of the plan, and out of curiosity I wanted to know if it works for the entire plan. It does not work for the entire plan in most cases as these tests show. In most cases it looks like we need to support more of the
LogicalPlan
nodes when serializing to substrait.So, I've added this round trip for all TPCH queries. I generated the TPCH queries from https://duckdb.org/docs/extensions/tpch.html. I then generated serialized json schemas which can be loaded into a context with a fake
TableProvider
.The other way I could see this happening is to skip optimization before, and compared the unoptimized plan upon deserialization for correctness. Then optimize and execute.
I've added tests for both paths, but if the unoptimized way is not intended to work, then we can remove those.
If my understanding of how this works is correct, then these tests are probably useful. If it is not, then I hope that conversation here can help me understand it and I'll improve the tests. 😄
Rationale for this change
More clearly highlight and identify missing features in substrait serialization.
What changes are included in this PR?
tests added
Are these changes tested?
CI will test them
Are there any user-facing changes?
No, just tests.