Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sort order for second dataset when using orderedComparison = false & ignoreColumnNames = true #93

Open
pkoplik24 opened this issue Apr 8, 2021 · 4 comments

Comments

@pkoplik24
Copy link

If orderedComparison is set to false, then the unordered comparison will sort the columns to provide a sort order for the dataset.

def defaultSortDataset[T](ds: Dataset[T]): Dataset[T] = {
val colNames = ds.columns.sorted
val cols = colNames.map(col)
ds.sort(cols: _*)
}

If the expected and actual datasets have different column sort orders because the names are different (and then ignoreColumnNames set to true), then the rows are sorted differently and the assertion fails.

Proposed fixes:

  1. use the column sort order for one of the datasets to sort the columns of the other dataset
  2. do not sort the columns at all if ignoreColumnNames = true
@MrPowers
Copy link
Collaborator

MrPowers commented Apr 9, 2021

@pkoplik24 - Thanks for pointing out this edge case.

I think the function should error out if orderedComparison=false and ignoreColumnNames=true. We can have it return a descriptive error message that explains why the combination of options doesn't make sense.

Does that sound like an OK approach with you?

@pkoplik24
Copy link
Author

pkoplik24 commented Apr 21, 2021

Hey @MrPowers sorry for the delay, busy week.

I actually do think this combination of parameters makes sense, which is how I came across this. I think the fix will be something similar to #91

As an example, I would expect this test to pass but it does not due to the row ordering.

def transformation(inputDf: DataFrame): DataFrame = {
    import inputDf.sparkSession.implicits._
    inputDf.map(x => x.getString(0))
      .flatMap(x => x.split(" "))
      .map(x => (x.toLowerCase(), 1))
      .groupByKey(x => x._1)
      .mapValues(x => x._2)
      .reduceGroups((a,b) => a + b)
      .map(x => (x._1, x._2))
      .toDF("word", "count")
  }

  "Wordcount transformation" should {

    "sample test" in {
        import sqlContext.implicits._

        val inputDf = Seq(
          "Hello this",
          "yes this is",
          "some text"
        ).toDF

      val expectedOutput = Seq(("hello", 1),
        ("this", 2),
        ("text", 1),
        ("is", 1),
        ("yes", 1),
        ("some", 1)).toDF

      assertSmallDataFrameEquality(transformation(inputDf), expectedOutput,
        ignoreColumnNames = true, orderedComparison = false)


    }

@pkoplik24
Copy link
Author

Hey @MrPowers, any thoughts on this?

@pkoplik24
Copy link
Author

pkoplik24 commented Dec 15, 2021

@MrPowers Bump

Why do this

def defaultSortDataset[T](ds: Dataset[T]): Dataset[T] = {
    val colNames = ds.columns.sorted
    val cols     = colNames.map(col)
    ds.sort(cols: _*)
  }

Instead of this

def defaultSortDataset[T](ds: Dataset[T]): Dataset[T] = {
    val colNames = ds.columns
    val cols     = colNames.map(col)
    ds.sort(cols: _*)
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants