Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add more cast tests and improve test framework #351

Merged
merged 6 commits into from
Apr 30, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Apr 29, 2024

Which issue does this PR close?

Part of #286

Rationale for this change

Get a better idea of which casts are failing.

What changes are included in this PR?

  • Add test to make sure we have coverage for valid cast combinations
  • Add tests to cover all cast operations supported by Spark (for the data types we currently support). Some of the tests are ignored and/or contain TODO statements
  • Bring in some framework improvements from pending PRs

How are these changes tested?

New tests

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments LGTM

}

private def generateDecimals(): DataFrame = {
Seq(BigDecimal("123456.789"), BigDecimal("-123456.789"), BigDecimal("0.0")).toDF("a")
}

private def generateString(r: Random, chars: String, maxLen: Int): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: String is the only type that does not return DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO for this

0.0f,
-0.0f) ++
Range(0, dataSize).map(_ => r.nextFloat())
values.toDF("a")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add nulls here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have pushed an update to add nulls to all generators except for string. I will address this in a future PR when I change the string generator to also return a DataFrame.

castTest(generateBytes(), DataTypes.BooleanType)
}

ignore("cast ByteType to ByteType") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the same type cast needed? It seems that assertTestsExist ignores this kind of cast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for redundant casts between same type and removed these tests. Spark should optimize redundant casts out.

ignore("cast long to short") {
castTest(generateLongs, DataTypes.ShortType)
test("all valid cast combinations covered") {
val names = testNames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is awesome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The nice thing is that this will tell us if Spark adds or removes any supported cast operations in the future.

}

// try_cast() should always return null for invalid inputs
val df2 = spark.sql(s"select try_cast(a as ${toType.sql}) from t")
val df2 =
spark.sql(s"select a, try_cast(a as ${toType.sql}) from t order by a")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a is null, the try_cast return null and will be treated as invalid input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_cast should return null for null inputs and also for invalid inputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean. If there is a test to cast an int to string and this is supported and valid, but I have a NULL in my dataset, so try_cast returns NULL as well. will that be treated as false negative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input dataset has a mix of valid, invalid, and null values, we shoul expect to see try_cast produce null for the rows that are null or invalid. The test is checking that we produce the same results as Spark, so I think we have everything covered, but maybe I am missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to merge this one so that we can start rebasing the other pending CAST PRs, but let's continue the conversation.

@andygrove
Copy link
Member Author

Thanks for the review @kazuyukitanimura. I have addressed your feedback. Could you take another look?

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andygrove LGTM

@@ -176,28 +697,38 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {

// cast() should throw exception on invalid inputs when ansi mode is enabled
val df = data.withColumn("converted", col("a").cast(toType))
val (expected, actual) = checkSparkThrows(df)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was improved to handle the case where no exceptions are thrown (some cast operations are infallible).

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @andygrove
I think its great, especially with dynamic test discovery

@andygrove andygrove merged commit 26af020 into apache:main Apr 30, 2024
29 checks passed
@andygrove andygrove deleted the more-cast-tests branch April 30, 2024 19:23
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. This is comprehensive. Thanks @andygrove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants