Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

Blizzara · 2024-11-04T12:13:11Z

Describe the bug

DataFusion's initcap behaves differently than Spark's. While both do "upper-case the first letter of each word and lowercase others", Spark considers as words anything separated by whitespace (' '), while DataFusion considers anything separated by non-ascii-alphanumeric as words. (DF's code would also fail to uppercase or lowercase non-ascii chars, but that doesn't materialize as a separate issue as it considers them separators already in the first place.)

#1051 shows the problem by adding two cases to the test, one using a dash and one using non-ascii letters (from Finnish).

== Results ==
!== Correct Answer - 7 ==       == Spark Answer - 7 ==
 struct<initcap(name):string>   struct<initcap(name):string>
 [James Smith]                  [James Smith]
 [James Smith]                  [James Smith]
![James Ähtäri]                 [James äHtäRi]
 [Michael Rose]                 [Michael Rose]
 [Rames Rose]                   [Rames Rose]
![Robert Rose-smith]            [Robert Rose-Smith]
 [Robert Williams]              [Robert Williams]

Steps to reproduce

Call initcap with an input containing non-ascii-alphanumeric non-whitespace characters

Expected behavior

Match Spark

Additional context

No response

The text was updated successfully, but these errors were encountered:

viirya · 2024-11-04T16:53:39Z

Thanks for reporting this bug.

comphead · 2024-12-13T01:26:18Z

DataFusion InitCap doesn't support nonASCII
Tracked in apache/datafusion#13711

comphead · 2024-12-13T01:28:52Z

![Robert Rose-smith]            [Robert Rose-Smith]

already fixed in latest DF

> select initcap('robert rose-smith');
+------------------------------------+
| initcap(Utf8("robert rose-smith")) |
+------------------------------------+
| Robert Rose-Smith                  |
+------------------------------------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

UPD: Spark returns

scala> spark.sql("select initcap('robert rose-smith')").show(false)
+--------------------------+
|initcap(robert rose-smith)|
+--------------------------+
|Robert Rose-smith         |
+--------------------------+

But reg to DF SQL policy apache/datafusion#13706 DF supports a builtin functions to be in sync with PG which returns Robert Rose-Smith and Comet likely need to implement own UDF function to cover this case

Blizzara · 2024-12-13T16:20:02Z

already fixed in latest DF

How is that fixed? :) The diff I showed is exactly the same as the diff you meantion below, no?

comphead · 2024-12-13T16:41:39Z

already fixed in latest DF

How is that fixed? :) The diff I showed is exactly the same as the diff you meantion below, no?

Please refer to message update above

kazuyukitanimura · 2024-12-16T23:46:35Z

Looks like unicode init cap is getting fixed in DF apache/datafusion#13752

Blizzara added the bug Something isn't working label Nov 4, 2024

andygrove added this to the 0.4.0 milestone Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

Blizzara commented Nov 4, 2024

viirya commented Nov 4, 2024 •

edited

Loading

comphead commented Dec 13, 2024

comphead commented Dec 13, 2024 •

edited

Loading

Blizzara commented Dec 13, 2024

comphead commented Dec 13, 2024

kazuyukitanimura commented Dec 16, 2024

Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

Comments

Blizzara commented Nov 4, 2024

Describe the bug

Steps to reproduce

Expected behavior

Additional context

viirya commented Nov 4, 2024 • edited Loading

comphead commented Dec 13, 2024

comphead commented Dec 13, 2024 • edited Loading

Blizzara commented Dec 13, 2024

comphead commented Dec 13, 2024

kazuyukitanimura commented Dec 16, 2024

viirya commented Nov 4, 2024 •

edited

Loading

comphead commented Dec 13, 2024 •

edited

Loading