Add substrait tpch round trip tests from sql query #13888

robtandy · 2024-12-23T15:27:33Z

Which issue does this PR close?

I've been investigating and experimenting with federating tpch query plans and sending the federated portion downstream encoded as substrait protos (Side node, is this useful or worth sharing?). When doing this I discovered the bug in issue #13860 . CC @alamb @Blizzara as we discussed the PR for that issue, and your input would be appreciated here as well.

In the course of that work, I discovered that there isn't testing coverage for round trip from TPCH query -> logical plan -> substrait -> logical plan.

My understanding of how do this is the following:

generate optimized logical plan from tpch query
encode as substrait
send elsewhere
decode into logical plan
optimize this logical plan as our local table providers are different than the ones where the previous optimization was made

After this the recreated plan should be the same and produce the same result when executed. This is my understanding from looking at tests for round trip in to substrait for arbitrary queries from https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L1242-L1266

This has worked for me when federating portions of the plan, and out of curiosity I wanted to know if it works for the entire plan. It does not work for the entire plan in most cases as these tests show. In most cases it looks like we need to support more of the LogicalPlan nodes when serializing to substrait.

So, I've added this round trip for all TPCH queries. I generated the TPCH queries from https://duckdb.org/docs/extensions/tpch.html. I then generated serialized json schemas which can be loaded into a context with a fake TableProvider.

The other way I could see this happening is to skip optimization before, and compared the unoptimized plan upon deserialization for correctness. Then optimize and execute.

I've added tests for both paths, but if the unoptimized way is not intended to work, then we can remove those.

If my understanding of how this works is correct, then these tests are probably useful. If it is not, then I hope that conversation here can help me understand it and I'll improve the tests. 😄

Rationale for this change

More clearly highlight and identify missing features in substrait serialization.

What changes are included in this PR?

tests added

Are these changes tested?

CI will test them

Are there any user-facing changes?

No, just tests.

vbarua · 2024-12-23T19:20:46Z

datafusion/substrait/tests/testdata/tpch_queries/query_01.sql

+    l_linestatus
+ORDER BY
+    l_returnflag,
+    l_linestatus;


I believe that there are already have copies of the TCPH queries in the repo:
https://github.com/apache/datafusion/tree/main/benchmarks/queries

I think it would be better to re-use them if possible.

It looks like the schemas are also available as well.

datafusion/benchmarks/src/tpch/mod.rs

Lines 44 to 45 in 405b99c

/// Get the schema for the benchmarks derived from TPC-H

pub fn get_tpch_table_schema(table: &str) -> Schema {

vbarua · 2024-12-23T19:36:37Z

There are already some tests for TPCH functionality in https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/consumer_integration.rs, but IMO those are weaker than what you're proposing because they only verify that we can read the TPCH plans that have been generated in https://github.com/substrait-io/consumer-testing/tree/7c1f5f1876f00c2685f722b592dbd00030662d5d/substrait_consumer/testdata/integration/tpch

The full roundtrip you're proposing is more comprehensive and lets us find bugs in both consumer and producer.

In most cases it looks like we need to support more of the LogicalPlan nodes when serializing to substrait.

It looks like your approach has already found some missing features 🐛 🔨

After this the recreated plan should be the same

In many cases plans roundtrip the same, but not necessarily in every case. DataFusion might consume a plan and then emit a plan that is semantically equivalent but has different nodes. For example, all emit kind remaps might be converted into projections. It is a desirable property that plans roundtrip the same, and probably something we can work towards in most cases.

Blizzara · 2024-12-23T19:41:47Z

I like the idea, more testing the better! We already have some Substrait TCP testing, but I think that's from "known Substrait" -> DF, so it only tests the consumer, while this would test the producer as well.

I think roughly there should be three levels of "equality":

a plan goes through roundtrip without crashing, but doesn't match equality (this doesn't necessarily mean the result is correct, but it shows that there aren't any totally unsupported LogicalPlan nodes)
"optimized" plan equality - calling optimize on both plans before comparing, like you do.
"unoptimized" plan equality - comparing the plans directly.

I feel like (2) should be "easier" check than (3), though the tests you have here show that some plans would pass unoptimized but not optimized, which confuses me.

Overall, I think (3) is nice to have, but not something I'd spent too much effort for - it doesn't really bring that much benefits to users compared to (2). (2) is nice compared to (1) in that it guarantees a bit more correctness, but in many cases it's unfortunately tough to support even (2) (even in theory, roundtrip isn't lossless, since multiple DF plans may result in the same Substrait plan, and the other way around too).

So I'd suggest splitting the TPCH tests into 4 categories:

those that pass (3)
those that pass (2)
those that pass (1)
- maybe eyeballed into a) those that look correct after roundtrip even if the plan doesn't fully match, and b) those that look incorrect, if any
those that fail to convert for the roundtrip

That allows merging the tests, preventing regressions, and tracking needed improvements :)

Blizzara · 2024-12-23T19:50:35Z

Haha, looks like @vbarua commented pretty pretty much the same thing while I was writing my own reply! 😄

robtandy · 2024-12-23T20:18:07Z

Thank you @vbarua and @Blizzara for your review and comments!

Yes, I think the existing tests do not go far enough, and I encountered bugs not covered by these tests already, which is why i wrote these.

I did not notice the existing schemas and sql in the benchmarks crate. I'll look to reuse them. Should we move them somewhere so that bench and test can access them without referring to each other?

Lastly, is it convention that we would add #[ignore] to the cases that do not pass at the moment, and open those up as we fix bugs?

alamb · 2024-12-24T14:28:32Z

Lastly, is it convention that we would add #[ignore] to the cases that do not pass at the moment, and open those up as we fix bugs?

I think that is a reasonable approach -- thank you.

Please also leave a link to the relevant github issue in the comments so future readers can find the relevant issue

alamb · 2024-12-24T14:29:13Z

(I really like the idea of merging the tests, even if they don't all pass, in one PR and then working on fixes to the tests as additional follow on PRs)

lots of substrait tpch round trip tests

5d585a0

github-actions bot added the substrait label Dec 23, 2024

vbarua reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add substrait tpch round trip tests from sql query #13888

Add substrait tpch round trip tests from sql query #13888

robtandy commented Dec 23, 2024

vbarua Dec 23, 2024

vbarua Dec 23, 2024

vbarua commented Dec 23, 2024

Blizzara commented Dec 23, 2024

Blizzara commented Dec 23, 2024

robtandy commented Dec 23, 2024

alamb commented Dec 24, 2024

alamb commented Dec 24, 2024

	/// Get the schema for the benchmarks derived from TPC-H
	pub fn get_tpch_table_schema(table: &str) -> Schema {

Add substrait tpch round trip tests from sql query #13888

Are you sure you want to change the base?

Add substrait tpch round trip tests from sql query #13888

Conversation

robtandy commented Dec 23, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

vbarua Dec 23, 2024

Choose a reason for hiding this comment

vbarua Dec 23, 2024

Choose a reason for hiding this comment

vbarua commented Dec 23, 2024

Blizzara commented Dec 23, 2024

Blizzara commented Dec 23, 2024

robtandy commented Dec 23, 2024

alamb commented Dec 24, 2024

alamb commented Dec 24, 2024