-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do not merge] Unicode Normalization APIs #75298
base: main
Are you sure you want to change the base?
Conversation
@swift-ci please build toolchain |
@swift-ci please build toolchain |
Some strange test failure is preventing the toolchain from being built 🫤 Failed CI job: https://ci.swift.org/job/swift-PR-toolchain-Linux/923/
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
This PR currently sort of treats the Equatable/Hashable/Comparable conformances as an afterthought right now. They are really important on their own right, they aren't trivial, and we'll need to properly implement them.
Throughout this PR (and some parts of the existing stdlib) we have this bad pattern of initializing strings through temporary buffers. I don't think this is right at all, and this work is a good opportunity to do better. We should be creating new strings by directly writing into _StringGuts storage; now is a good time to introduce internal operations to support that.
|
||
// Unicode Scalars encode to a maximum of 4 bytes of UTF-8. | ||
var utf8 = [UInt8]() | ||
utf8.reserveCapacity(scalars.underestimatedCount * 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this 4x factor will often be a wild overestimate. We should do this with zero temporary allocations -- we can and should just populate a _StringGuts
instance directly.
) { | ||
// Unicode Scalars encode to a maximum of 4 bytes of UTF-8. | ||
self = _withUnprotectedUnsafeTemporaryAllocation( | ||
of: UInt8.self, capacity: scalars.count * 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This temp allocation seems unnecessary, and I expect it will wildly overestimate necessary storage -- we should just traverse the collection twice, and directly initialize a String
instance of the exact size we need.
_ scalars: consuming Unicode.NormalizedScalars<some Sequence>.NFC | ||
) { | ||
// Unicode.NormalizedScalars cannot provide a good underestimatedCount. | ||
var utf8 = [UInt8]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the temporary buffer? We should be directly initializing a _StringGuts
instance here, too.
} | ||
} | ||
|
||
// FIXME: Improve these and move them somewhere logical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Please remember to move these to the files that contain the definitions of these structs. (The right place is usually in an extension immediately following the struct declaration.)
We'll also need to remember to add test coverage for the new Equatable/Comparable/Hashable conformances -- checkHashable
/checkComparable
will be fine. (E.g., we need to avoid a repeat of shipping a substring with broken hashing.)
} | ||
|
||
public func hash(into hasher: inout Hasher) { | ||
for cu in self { hasher.combine(cu) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two Three things:
-
This is a variable-length hash encoding, so it requires a terminator or some other discriminator to avoid becoming a hash collision generator when aggregates get hashed. For UTF-8 data, the right pattern is to terminate the sequence by hashing an extra byte at the end that isn't valid. String uses
0xFF
, and I expect we'd do the same here. -
In the common case where the string is already in fast UTF-8 form, we should be hashing the entire storage in one go, rather than combining each byte one by one.
-
I think all these implementations should be backdeployable, to simplify adoption in code that needs to run on earlier stdlibs. (Note: unless we do extra work, this will etch the hash encodings in stone, so we better be sure we have the right implementations. The extra work is to have the
==
/<
/hash(into:)
implementations dispatch to opaque functions on new enough stdlibs, so that we only set in stone the code that we run on older systems.)
for cu in self { hasher.combine(cu) } | |
let done = self.withContiguousStorageIfAvailable { buf in | |
hasher.combine(bytes: UnsafeRawBufferPointer(buf)) | |
hasher.combine(0xFF) // terminator | |
return true | |
} | |
if done == true { return } | |
for codeUnit in self { hasher.combine(codeUnit) } | |
hasher.combine(0xFF) // terminator |
_ scalars: consuming Unicode.NormalizedScalars<some Sequence>.NFKC | ||
) { | ||
// Unicode.NormalizedScalars cannot provide a good underestimatedCount. | ||
var utf8 = [UInt8]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue -- just append to a native _StringGuts
directly.
String._uncheckedFromUTF8($0) | ||
} | ||
} else { | ||
result = Array(str.utf8).withUnsafeBufferPointer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now is our chance to get rid of these temp arrays.
/// - form: The canonical normalization form. | ||
/// | ||
@_specialize(where Self == String) | ||
@_specialize(where Self == Substring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid hanging new functionality on StringProtocol
, but even if we want to keep this here, we should make sure we add explicit entry points for this functionality directly on String
and Substring
.
// | ||
// This is not true of NFD. | ||
|
||
return _withUnprotectedUnsafeTemporaryAllocation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think this is a good pattern. We should do away with these temporary buffers, and we should rather be initializing a _StringGuts
instance directly.
_gutsSlice.utf8Count
has linear complexity for bridged Cocoa strings, and that makes creating the temp buffer even more questionable.
// | ||
// This source file is part of the Swift.org open source project | ||
// | ||
// Copyright (c) 2023 Apple Inc. and the Swift project authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright (c) 2023 Apple Inc. and the Swift project authors | |
// Copyright (c) 2024 Apple Inc. and the Swift project authors |
Thank you @lorentey!
Yes, these parts have very naive implementations - I haven't gotten around to looking at them in depth, but since the normalisation pitch suggests using them as a way to achieve predictable performance, I wanted them to be available in some form in the accompanying toolchain. They definitely should be improved, and in particular thanks for your insight on implementing hashing. AFAIK it is not possible to backdeploy protocol conformances, though - only functions 🫤
I very much agree and have mentioned exactly that in String-related PRs over the years. I'll take a look to see what can be done. |
Implements (almost) everything in swiftlang/swift-evolution#2512
PR is just for building a toolchain.