[SPARK-50618][SS][SQL] Make DataFrameReader and DataStreamReader leverage the analyzer more #49238

brkyvz · 2024-12-19T00:14:46Z

What changes were proposed in this pull request?

Introduces two logical nodes:

UnresolvedDataSource
UnresolvedJDBCRelation

The DataFrameReader and DataStreamReader creates these unresolved nodes instead, and calls the analyzer to resolve these data sources. These then get analyzed as part of the ResolveDataSource rule. All logic in DataFrameReader and DataStreamReader has been moved here.

There is still logic around text based format parsing on an existing Dataset. I will refactor this in a subsequent PR.

Why are the changes needed?

The DataFrameReader and DataStreamReader typically creates analyzed relations as part of their respective .load() methods.

This creates inconsistencies for what rules get applied to the query plan as part of Catalyst depending on your API of choice, such as SQL vs Python or SQL vs Scala.

The goal of this Jira is to refactor the logic in DataFrameReader and DataStreamReader classes to create unresolved plans that get analyzed as part of Catalyst.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests and will add new tests

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2024-12-19T09:46:44Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-    }
-    val relation = JDBCRelation(parts, options)(sparkSession)
-    sparkSession.baseRelationToDataFrame(relation)
+    Dataset.ofRows(sparkSession, UnresolvedJDBCRelation(url, table, predicates, params))


shall we reuse UnresolvedDataSource with the source set as "jdbc"?

hmm the parameters are different. I'm wondering how this is done in SQL. Seems people can't use arbitrary predicates to build JDBCPartition with SQL API.

maybe we should add this feature to SQL first, and then unify it.

having this available at the moment can help us unify it in SQL afterwards :) This seems to be the only edge case - seems like a new API since I looked at Spark. Do you want me to leave it as is and just migrate the file based and generic data sources?

yea let's leave out the JDBC one for now. Once we have feature parity between SQL and DataFrame, we can revisit this.

brkyvz · 2024-12-20T22:39:56Z

Thanks for the feedback @cloud-fan! Addressed

cloud-fan · 2024-12-23T05:59:01Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+    assertFirstUnresolved(spark.read.format("org.apache.spark.sql.test").load())
+    assertFirstUnresolved(spark.read.format("org.apache.spark.sql.test").load(dir))
+    assertFirstUnresolved(spark.read.format("org.apache.spark.sql.test").load(dir, dir, dir))
+    assertFirstUnresolved(spark.read.format("org.apache.spark.sql.test").load(Seq(dir, dir): _*))
    Option(dir).map(spark.read.format("org.apache.spark.sql.test").load)


unrelated to your PR but what does this line test?

Make DataFrameReader and DataStreamReader leverage the analyzer more

ed37858

github-actions bot added SQL STRUCTURED STREAMING labels Dec 19, 2024

Burak Yavuz added 4 commits December 18, 2024 16:16

forgot

e1014ca

Merge branch 'master' of github.com:apache/spark into unresolvedDS

29f286f

Fix merge conflicts

1534cad

Added unit tests

853a13e

cloud-fan reviewed Dec 19, 2024

View reviewed changes

remove jdbc changes

d802777

HyukjinKwon changed the title ~~[SPARK-50618] Make DataFrameReader and DataStreamReader leverage the analyzer more~~ [SPARK-50618][SS] Make DataFrameReader and DataStreamReader leverage the analyzer more Dec 22, 2024

HyukjinKwon changed the title ~~[SPARK-50618][SS] Make DataFrameReader and DataStreamReader leverage the analyzer more~~ [SPARK-50618][SS][SQL] Make DataFrameReader and DataStreamReader leverage the analyzer more Dec 22, 2024

cloud-fan reviewed Dec 23, 2024

View reviewed changes

cloud-fan approved these changes Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50618][SS][SQL] Make DataFrameReader and DataStreamReader leverage the analyzer more #49238

[SPARK-50618][SS][SQL] Make DataFrameReader and DataStreamReader leverage the analyzer more #49238

brkyvz commented Dec 19, 2024

cloud-fan Dec 19, 2024

cloud-fan Dec 19, 2024

cloud-fan Dec 19, 2024

brkyvz Dec 19, 2024

cloud-fan Dec 20, 2024

brkyvz commented Dec 20, 2024

cloud-fan Dec 23, 2024

[SPARK-50618][SS][SQL] Make DataFrameReader and DataStreamReader leverage the analyzer more #49238

Are you sure you want to change the base?

[SPARK-50618][SS][SQL] Make DataFrameReader and DataStreamReader leverage the analyzer more #49238

Conversation

brkyvz commented Dec 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

brkyvz Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 20, 2024

Choose a reason for hiding this comment

brkyvz commented Dec 20, 2024

cloud-fan Dec 23, 2024

Choose a reason for hiding this comment