Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge] Unicode Normalization APIs #75298

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

karwa
Copy link
Contributor

@karwa karwa commented Jul 17, 2024

Implements (almost) everything in swiftlang/swift-evolution#2512

PR is just for building a toolchain.

@karwa karwa requested review from a team and ktoso as code owners July 17, 2024 16:09
@karwa karwa marked this pull request as draft July 17, 2024 16:10
@karwa
Copy link
Contributor Author

karwa commented Jul 17, 2024

@swift-ci please build toolchain

@karwa karwa force-pushed the unicode-resplit branch from 22be66c to 39655ef Compare July 17, 2024 17:01
@karwa
Copy link
Contributor Author

karwa commented Jul 17, 2024

@swift-ci please build toolchain

@karwa
Copy link
Contributor Author

karwa commented Jul 17, 2024

Some strange test failure is preventing the toolchain from being built 🫤

Failed CI job: https://ci.swift.org/job/swift-PR-toolchain-Linux/923/

Error: Invalid option "--package-path"
Usage: fooPackageTests.xctest [OPTION]
       fooPackageTests.xctest [TESTCASE]
Run and report results of test cases.

...

********************
Failed Tests (2):
  swift-package-tests :: test-codecov-package/test-codecov-package.txt
  swift-package-tests :: test-complex-xctest-package/test-xctest-package.txt

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

This PR currently sort of treats the Equatable/Hashable/Comparable conformances as an afterthought right now. They are really important on their own right, they aren't trivial, and we'll need to properly implement them.

Throughout this PR (and some parts of the existing stdlib) we have this bad pattern of initializing strings through temporary buffers. I don't think this is right at all, and this work is a good opportunity to do better. We should be creating new strings by directly writing into _StringGuts storage; now is a good time to introduce internal operations to support that.


// Unicode Scalars encode to a maximum of 4 bytes of UTF-8.
var utf8 = [UInt8]()
utf8.reserveCapacity(scalars.underestimatedCount * 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this 4x factor will often be a wild overestimate. We should do this with zero temporary allocations -- we can and should just populate a _StringGuts instance directly.

) {
// Unicode Scalars encode to a maximum of 4 bytes of UTF-8.
self = _withUnprotectedUnsafeTemporaryAllocation(
of: UInt8.self, capacity: scalars.count * 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This temp allocation seems unnecessary, and I expect it will wildly overestimate necessary storage -- we should just traverse the collection twice, and directly initialize a String instance of the exact size we need.

_ scalars: consuming Unicode.NormalizedScalars<some Sequence>.NFC
) {
// Unicode.NormalizedScalars cannot provide a good underestimatedCount.
var utf8 = [UInt8]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the temporary buffer? We should be directly initializing a _StringGuts instance here, too.

}
}

// FIXME: Improve these and move them somewhere logical.
Copy link
Member

@lorentey lorentey Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Please remember to move these to the files that contain the definitions of these structs. (The right place is usually in an extension immediately following the struct declaration.)

We'll also need to remember to add test coverage for the new Equatable/Comparable/Hashable conformances -- checkHashable/checkComparable will be fine. (E.g., we need to avoid a repeat of shipping a substring with broken hashing.)

}

public func hash(into hasher: inout Hasher) {
for cu in self { hasher.combine(cu) }
Copy link
Member

@lorentey lorentey Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two Three things:

  1. This is a variable-length hash encoding, so it requires a terminator or some other discriminator to avoid becoming a hash collision generator when aggregates get hashed. For UTF-8 data, the right pattern is to terminate the sequence by hashing an extra byte at the end that isn't valid. String uses 0xFF, and I expect we'd do the same here.

  2. In the common case where the string is already in fast UTF-8 form, we should be hashing the entire storage in one go, rather than combining each byte one by one.

  3. I think all these implementations should be backdeployable, to simplify adoption in code that needs to run on earlier stdlibs. (Note: unless we do extra work, this will etch the hash encodings in stone, so we better be sure we have the right implementations. The extra work is to have the ==/</hash(into:) implementations dispatch to opaque functions on new enough stdlibs, so that we only set in stone the code that we run on older systems.)

Suggested change
for cu in self { hasher.combine(cu) }
let done = self.withContiguousStorageIfAvailable { buf in
hasher.combine(bytes: UnsafeRawBufferPointer(buf))
hasher.combine(0xFF) // terminator
return true
}
if done == true { return }
for codeUnit in self { hasher.combine(codeUnit) }
hasher.combine(0xFF) // terminator

_ scalars: consuming Unicode.NormalizedScalars<some Sequence>.NFKC
) {
// Unicode.NormalizedScalars cannot provide a good underestimatedCount.
var utf8 = [UInt8]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue -- just append to a native _StringGuts directly.

String._uncheckedFromUTF8($0)
}
} else {
result = Array(str.utf8).withUnsafeBufferPointer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now is our chance to get rid of these temp arrays.

/// - form: The canonical normalization form.
///
@_specialize(where Self == String)
@_specialize(where Self == Substring)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid hanging new functionality on StringProtocol, but even if we want to keep this here, we should make sure we add explicit entry points for this functionality directly on String and Substring.

//
// This is not true of NFD.

return _withUnprotectedUnsafeTemporaryAllocation(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think this is a good pattern. We should do away with these temporary buffers, and we should rather be initializing a _StringGuts instance directly.

_gutsSlice.utf8Count has linear complexity for bridged Cocoa strings, and that makes creating the temp buffer even more questionable.

//
// This source file is part of the Swift.org open source project
//
// Copyright (c) 2023 Apple Inc. and the Swift project authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright (c) 2023 Apple Inc. and the Swift project authors
// Copyright (c) 2024 Apple Inc. and the Swift project authors

@karwa
Copy link
Contributor Author

karwa commented Jul 21, 2024

Thank you @lorentey!

This PR currently sort of treats the Equatable/Hashable/Comparable conformances as an afterthought right now. They are really important on their own right, they aren't trivial, and we'll need to properly implement them.

Yes, these parts have very naive implementations - I haven't gotten around to looking at them in depth, but since the normalisation pitch suggests using them as a way to achieve predictable performance, I wanted them to be available in some form in the accompanying toolchain. They definitely should be improved, and in particular thanks for your insight on implementing hashing.

AFAIK it is not possible to backdeploy protocol conformances, though - only functions 🫤

Throughout this PR (and some parts of the existing stdlib) we have this bad pattern of initializing strings through temporary buffers. I don't think this is right at all, and this work is a good opportunity to do better. We should be creating new strings by directly writing into _StringGuts storage; now is a good time to introduce internal operations to support that.

I very much agree and have mentioned exactly that in String-related PRs over the years. I'll take a look to see what can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants