[Do not merge] Unicode Normalization APIs #75298

karwa · 2024-07-17T16:09:16Z

Implements (almost) everything in swiftlang/swift-evolution#2512

PR is just for building a toolchain.

karwa · 2024-07-17T16:11:31Z

@swift-ci please build toolchain

…lCombiningClass

karwa · 2024-07-17T17:02:20Z

@swift-ci please build toolchain

karwa · 2024-07-17T19:02:31Z

Some strange test failure is preventing the toolchain from being built 🫤

Failed CI job: https://ci.swift.org/job/swift-PR-toolchain-Linux/923/

Error: Invalid option "--package-path"
Usage: fooPackageTests.xctest [OPTION]
       fooPackageTests.xctest [TESTCASE]
Run and report results of test cases.

...

********************
Failed Tests (2):
  swift-package-tests :: test-codecov-package/test-codecov-package.txt
  swift-package-tests :: test-complex-xctest-package/test-xctest-package.txt

lorentey

Nice!

This PR currently sort of treats the Equatable/Hashable/Comparable conformances as an afterthought right now. They are really important on their own right, they aren't trivial, and we'll need to properly implement them.

Throughout this PR (and some parts of the existing stdlib) we have this bad pattern of initializing strings through temporary buffers. I don't think this is right at all, and this work is a good opportunity to do better. We should be creating new strings by directly writing into _StringGuts storage; now is a good time to introduce internal operations to support that.

lorentey · 2024-07-19T20:45:43Z

stdlib/public/core/String.swift

+
+    // Unicode Scalars encode to a maximum of 4 bytes of UTF-8.
+    var utf8 = [UInt8]()
+    utf8.reserveCapacity(scalars.underestimatedCount * 4)


I expect this 4x factor will often be a wild overestimate. We should do this with zero temporary allocations -- we can and should just populate a _StringGuts instance directly.

lorentey · 2024-07-19T20:49:41Z

stdlib/public/core/String.swift

+  ) {
+    // Unicode Scalars encode to a maximum of 4 bytes of UTF-8.
+    self = _withUnprotectedUnsafeTemporaryAllocation(
+      of: UInt8.self, capacity: scalars.count * 4


This temp allocation seems unnecessary, and I expect it will wildly overestimate necessary storage -- we should just traverse the collection twice, and directly initialize a String instance of the exact size we need.

lorentey · 2024-07-19T20:51:44Z

stdlib/public/core/String.swift

+    _ scalars: consuming Unicode.NormalizedScalars<some Sequence>.NFC
+  ) {
+    // Unicode.NormalizedScalars cannot provide a good underestimatedCount.
+    var utf8 = [UInt8]()


Why the temporary buffer? We should be directly initializing a _StringGuts instance here, too.

lorentey · 2024-07-19T20:54:00Z

stdlib/public/core/Character.swift

+  }
+}
+
+// FIXME: Improve these and move them somewhere logical.


Indeed. Please remember to move these to the files that contain the definitions of these structs. (The right place is usually in an extension immediately following the struct declaration.)

We'll also need to remember to add test coverage for the new Equatable/Comparable/Hashable conformances -- checkHashable/checkComparable will be fine. (E.g., we need to avoid a repeat of shipping a substring with broken hashing.)

lorentey · 2024-07-19T20:58:43Z

stdlib/public/core/Character.swift

+  }
+
+  public func hash(into hasher: inout Hasher) {
+    for cu in self { hasher.combine(cu) }


~~Two~~ Three things:

This is a variable-length hash encoding, so it requires a terminator or some other discriminator to avoid becoming a hash collision generator when aggregates get hashed. For UTF-8 data, the right pattern is to terminate the sequence by hashing an extra byte at the end that isn't valid. String uses 0xFF, and I expect we'd do the same here.

In the common case where the string is already in fast UTF-8 form, we should be hashing the entire storage in one go, rather than combining each byte one by one.

I think all these implementations should be backdeployable, to simplify adoption in code that needs to run on earlier stdlibs. (Note: unless we do extra work, this will etch the hash encodings in stone, so we better be sure we have the right implementations. The extra work is to have the ==/</hash(into:) implementations dispatch to opaque functions on new enough stdlibs, so that we only set in stone the code that we run on older systems.)

Suggested change

for cu in self { hasher.combine(cu) }

let done = self.withContiguousStorageIfAvailable { buf in

hasher.combine(bytes: UnsafeRawBufferPointer(buf))

hasher.combine(0xFF) // terminator

return true

}

if done == true { return }

for codeUnit in self { hasher.combine(codeUnit) }

hasher.combine(0xFF) // terminator

lorentey · 2024-07-19T21:40:26Z

stdlib/public/core/String.swift

+    _ scalars: consuming Unicode.NormalizedScalars<some Sequence>.NFKC
+  ) {
+    // Unicode.NormalizedScalars cannot provide a good underestimatedCount.
+    var utf8 = [UInt8]()


Same issue -- just append to a native _StringGuts directly.

lorentey · 2024-07-19T21:41:40Z

stdlib/public/core/StringCreate.swift

+        String._uncheckedFromUTF8($0)
+      }
+    } else {
+      result = Array(str.utf8).withUnsafeBufferPointer {


I think now is our chance to get rid of these temp arrays.

lorentey · 2024-07-19T21:45:43Z

stdlib/public/core/StringNormalization.swift

+  ///   - form: The canonical normalization form.
+  ///
+  @_specialize(where Self == String)
+  @_specialize(where Self == Substring)


I think we should avoid hanging new functionality on StringProtocol, but even if we want to keep this here, we should make sure we add explicit entry points for this functionality directly on String and Substring.

lorentey · 2024-07-19T21:47:11Z

stdlib/public/core/StringNormalization.swift

+      //
+      // This is not true of NFD.
+
+      return _withUnprotectedUnsafeTemporaryAllocation(


Again, I don't think this is a good pattern. We should do away with these temporary buffers, and we should rather be initializing a _StringGuts instance directly.

_gutsSlice.utf8Count has linear complexity for bridged Cocoa strings, and that makes creating the temp buffer even more questionable.

lorentey · 2024-07-19T22:04:29Z

stdlib/public/core/Unicode+NormalizedScalars.swift

+//
+// This source file is part of the Swift.org open source project
+//
+// Copyright (c) 2023 Apple Inc. and the Swift project authors


Suggested change

// Copyright (c) 2023 Apple Inc. and the Swift project authors

// Copyright (c) 2024 Apple Inc. and the Swift project authors

karwa · 2024-07-21T12:25:05Z

Thank you @lorentey!

This PR currently sort of treats the Equatable/Hashable/Comparable conformances as an afterthought right now. They are really important on their own right, they aren't trivial, and we'll need to properly implement them.

Yes, these parts have very naive implementations - I haven't gotten around to looking at them in depth, but since the normalisation pitch suggests using them as a way to achieve predictable performance, I wanted them to be available in some form in the accompanying toolchain. They definitely should be improved, and in particular thanks for your insight on implementing hashing.

AFAIK it is not possible to backdeploy protocol conformances, though - only functions 🫤

Throughout this PR (and some parts of the existing stdlib) we have this bad pattern of initializing strings through temporary buffers. I don't think this is right at all, and this work is a good opportunity to do better. We should be creating new strings by directly writing into _StringGuts storage; now is a good time to introduce internal operations to support that.

I very much agree and have mentioned exactly that in String-related PRs over the years. I'll take a look to see what can be done.

karwa requested review from a team and ktoso as code owners July 17, 2024 16:09

karwa marked this pull request as draft July 17, 2024 16:10

karwa added 6 commits July 17, 2024 18:57

[stdlib] Use broader fast-path for Unicode.Scalar.Properties.canonica…

113ad82

…lCombiningClass

[UnicodeData] Fix GenNormalization

86be4bb

[UnicodeData] Sort NormData for easier diffing (NFC)

0ddc2b2

[UnicodeData] Create directory when emitting output data

9eccae9

Unicode Normalization API

a5975fe

Temporary changes for toolchain

39655ef

karwa force-pushed the unicode-resplit branch from 22be66c to 39655ef Compare July 17, 2024 17:01

lorentey reviewed Jul 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merge] Unicode Normalization APIs #75298

[Do not merge] Unicode Normalization APIs #75298

karwa commented Jul 17, 2024

karwa commented Jul 17, 2024

karwa commented Jul 17, 2024

karwa commented Jul 17, 2024 •

edited

Loading

lorentey left a comment

lorentey Jul 19, 2024

lorentey Jul 19, 2024

lorentey Jul 19, 2024

lorentey Jul 19, 2024 •

edited

Loading

lorentey Jul 19, 2024 •

edited

Loading

lorentey Jul 19, 2024

lorentey Jul 19, 2024

lorentey Jul 19, 2024

lorentey Jul 19, 2024

lorentey Jul 19, 2024

karwa commented Jul 21, 2024

-    for cu in self { hasher.combine(cu) }
+    let done = self.withContiguousStorageIfAvailable { buf in
+      hasher.combine(bytes: UnsafeRawBufferPointer(buf))
+      hasher.combine(0xFF) // terminator
+      return true
+    }
+    if done == true { return }
+    for codeUnit in self { hasher.combine(codeUnit) }
+    hasher.combine(0xFF) // terminator

	// Copyright (c) 2023 Apple Inc. and the Swift project authors
	// Copyright (c) 2024 Apple Inc. and the Swift project authors

[Do not merge] Unicode Normalization APIs #75298

Are you sure you want to change the base?

[Do not merge] Unicode Normalization APIs #75298

Conversation

karwa commented Jul 17, 2024

karwa commented Jul 17, 2024

karwa commented Jul 17, 2024

karwa commented Jul 17, 2024 • edited Loading

lorentey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentey Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

lorentey Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karwa commented Jul 21, 2024

karwa commented Jul 17, 2024 •

edited

Loading

lorentey Jul 19, 2024 •

edited

Loading

lorentey Jul 19, 2024 •

edited

Loading