Use sort order for second dataset when using orderedComparison = false & ignoreColumnNames = true #93

pkoplik24 · 2021-04-08T23:30:11Z

If orderedComparison is set to false, then the unordered comparison will sort the columns to provide a sort order for the dataset.

def defaultSortDataset[T](ds: Dataset[T]): Dataset[T] = {
val colNames = ds.columns.sorted
val cols = colNames.map(col)
ds.sort(cols: _*)
}

If the expected and actual datasets have different column sort orders because the names are different (and then ignoreColumnNames set to true), then the rows are sorted differently and the assertion fails.

Proposed fixes:

use the column sort order for one of the datasets to sort the columns of the other dataset
do not sort the columns at all if ignoreColumnNames = true

MrPowers · 2021-04-09T12:23:09Z

@pkoplik24 - Thanks for pointing out this edge case.

I think the function should error out if orderedComparison=false and ignoreColumnNames=true. We can have it return a descriptive error message that explains why the combination of options doesn't make sense.

Does that sound like an OK approach with you?

pkoplik24 · 2021-04-21T19:27:00Z

Hey @MrPowers sorry for the delay, busy week.

I actually do think this combination of parameters makes sense, which is how I came across this. I think the fix will be something similar to #91

As an example, I would expect this test to pass but it does not due to the row ordering.

def transformation(inputDf: DataFrame): DataFrame = {
    import inputDf.sparkSession.implicits._
    inputDf.map(x => x.getString(0))
      .flatMap(x => x.split(" "))
      .map(x => (x.toLowerCase(), 1))
      .groupByKey(x => x._1)
      .mapValues(x => x._2)
      .reduceGroups((a,b) => a + b)
      .map(x => (x._1, x._2))
      .toDF("word", "count")
  }

  "Wordcount transformation" should {

    "sample test" in {
        import sqlContext.implicits._

        val inputDf = Seq(
          "Hello this",
          "yes this is",
          "some text"
        ).toDF

      val expectedOutput = Seq(("hello", 1),
        ("this", 2),
        ("text", 1),
        ("is", 1),
        ("yes", 1),
        ("some", 1)).toDF

      assertSmallDataFrameEquality(transformation(inputDf), expectedOutput,
        ignoreColumnNames = true, orderedComparison = false)


    }

pkoplik24 · 2021-06-20T19:11:19Z

Hey @MrPowers, any thoughts on this?

pkoplik24 · 2021-12-15T20:53:12Z

@MrPowers Bump

Why do this

def defaultSortDataset[T](ds: Dataset[T]): Dataset[T] = {
    val colNames = ds.columns.sorted
    val cols     = colNames.map(col)
    ds.sort(cols: _*)
  }

Instead of this

def defaultSortDataset[T](ds: Dataset[T]): Dataset[T] = {
    val colNames = ds.columns
    val cols     = colNames.map(col)
    ds.sort(cols: _*)
  }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use sort order for second dataset when using orderedComparison = false & ignoreColumnNames = true #93

Use sort order for second dataset when using orderedComparison = false & ignoreColumnNames = true #93

pkoplik24 commented Apr 8, 2021

MrPowers commented Apr 9, 2021

pkoplik24 commented Apr 21, 2021 •

edited

Loading

pkoplik24 commented Jun 20, 2021

pkoplik24 commented Dec 15, 2021 •

edited

Loading

Use sort order for second dataset when using orderedComparison = false & ignoreColumnNames = true #93

Use sort order for second dataset when using orderedComparison = false & ignoreColumnNames = true #93

Comments

pkoplik24 commented Apr 8, 2021

MrPowers commented Apr 9, 2021

pkoplik24 commented Apr 21, 2021 • edited Loading

pkoplik24 commented Jun 20, 2021

pkoplik24 commented Dec 15, 2021 • edited Loading

pkoplik24 commented Apr 21, 2021 •

edited

Loading

pkoplik24 commented Dec 15, 2021 •

edited

Loading