Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add substrait tpch round trip tests from sql query #13888

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

robtandy
Copy link
Contributor

Which issue does this PR close?

I've been investigating and experimenting with federating tpch query plans and sending the federated portion downstream encoded as substrait protos (Side node, is this useful or worth sharing?). When doing this I discovered the bug in issue #13860 . CC @alamb @Blizzara as we discussed the PR for that issue, and your input would be appreciated here as well.

In the course of that work, I discovered that there isn't testing coverage for round trip from TPCH query -> logical plan -> substrait -> logical plan.

My understanding of how do this is the following:

  • generate optimized logical plan from tpch query
  • encode as substrait
  • send elsewhere
  • decode into logical plan
  • optimize this logical plan as our local table providers are different than the ones where the previous optimization was made

After this the recreated plan should be the same and produce the same result when executed. This is my understanding from looking at tests for round trip in to substrait for arbitrary queries from https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L1242-L1266

This has worked for me when federating portions of the plan, and out of curiosity I wanted to know if it works for the entire plan. It does not work for the entire plan in most cases as these tests show. In most cases it looks like we need to support more of the LogicalPlan nodes when serializing to substrait.

So, I've added this round trip for all TPCH queries. I generated the TPCH queries from https://duckdb.org/docs/extensions/tpch.html. I then generated serialized json schemas which can be loaded into a context with a fake TableProvider.

The other way I could see this happening is to skip optimization before, and compared the unoptimized plan upon deserialization for correctness. Then optimize and execute.

I've added tests for both paths, but if the unoptimized way is not intended to work, then we can remove those.

If my understanding of how this works is correct, then these tests are probably useful. If it is not, then I hope that conversation here can help me understand it and I'll improve the tests. 😄

Rationale for this change

More clearly highlight and identify missing features in substrait serialization.

What changes are included in this PR?

tests added

Are these changes tested?

CI will test them

Are there any user-facing changes?

No, just tests.

l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there are already have copies of the TCPH queries in the repo:
https://github.com/apache/datafusion/tree/main/benchmarks/queries

I think it would be better to re-use them if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the schemas are also available as well.

/// Get the schema for the benchmarks derived from TPC-H
pub fn get_tpch_table_schema(table: &str) -> Schema {

@vbarua
Copy link
Contributor

vbarua commented Dec 23, 2024

There are already some tests for TPCH functionality in https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/consumer_integration.rs, but IMO those are weaker than what you're proposing because they only verify that we can read the TPCH plans that have been generated in https://github.com/substrait-io/consumer-testing/tree/7c1f5f1876f00c2685f722b592dbd00030662d5d/substrait_consumer/testdata/integration/tpch

The full roundtrip you're proposing is more comprehensive and lets us find bugs in both consumer and producer.

In most cases it looks like we need to support more of the LogicalPlan nodes when serializing to substrait.

It looks like your approach has already found some missing features 🐛 🔨

After this the recreated plan should be the same

In many cases plans roundtrip the same, but not necessarily in every case. DataFusion might consume a plan and then emit a plan that is semantically equivalent but has different nodes. For example, all emit kind remaps might be converted into projections. It is a desirable property that plans roundtrip the same, and probably something we can work towards in most cases.

@Blizzara
Copy link
Contributor

I like the idea, more testing the better! We already have some Substrait TCP testing, but I think that's from "known Substrait" -> DF, so it only tests the consumer, while this would test the producer as well.

I think roughly there should be three levels of "equality":

  1. a plan goes through roundtrip without crashing, but doesn't match equality (this doesn't necessarily mean the result is correct, but it shows that there aren't any totally unsupported LogicalPlan nodes)
  2. "optimized" plan equality - calling optimize on both plans before comparing, like you do.
  3. "unoptimized" plan equality - comparing the plans directly.

I feel like (2) should be "easier" check than (3), though the tests you have here show that some plans would pass unoptimized but not optimized, which confuses me.

Overall, I think (3) is nice to have, but not something I'd spent too much effort for - it doesn't really bring that much benefits to users compared to (2). (2) is nice compared to (1) in that it guarantees a bit more correctness, but in many cases it's unfortunately tough to support even (2) (even in theory, roundtrip isn't lossless, since multiple DF plans may result in the same Substrait plan, and the other way around too).

So I'd suggest splitting the TPCH tests into 4 categories:

  • those that pass (3)
  • those that pass (2)
  • those that pass (1)
    • maybe eyeballed into a) those that look correct after roundtrip even if the plan doesn't fully match, and b) those that look incorrect, if any
  • those that fail to convert for the roundtrip

That allows merging the tests, preventing regressions, and tracking needed improvements :)

@Blizzara
Copy link
Contributor

Haha, looks like @vbarua commented pretty pretty much the same thing while I was writing my own reply! 😄

@robtandy
Copy link
Contributor Author

Thank you @vbarua and @Blizzara for your review and comments!

Yes, I think the existing tests do not go far enough, and I encountered bugs not covered by these tests already, which is why i wrote these.

I did not notice the existing schemas and sql in the benchmarks crate. I'll look to reuse them. Should we move them somewhere so that bench and test can access them without referring to each other?

Lastly, is it convention that we would add #[ignore] to the cases that do not pass at the moment, and open those up as we fix bugs?

@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

Lastly, is it convention that we would add #[ignore] to the cases that do not pass at the moment, and open those up as we fix bugs?

I think that is a reasonable approach -- thank you.

Please also leave a link to the relevant github issue in the comments so future readers can find the relevant issue

@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

(I really like the idea of merging the tests, even if they don't all pass, in one PR and then working on fixes to the tests as additional follow on PRs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants