-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel removal of items #3008
Conversation
I'll need a hand with the API breakage check. I can't reproduce the result locally, and I'm not entirely clear how to prevent the breakage reported in CI—would we not add a new
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great so far, I left a couple of notes inline.
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
This PR refactors the logic to remove items. Instead of sequentially discovering and deleting items one by one, we concurrently traverse all directories that we find and then sequentially issue file deletions. This new deletion logic is currently unbounded, and will spawn a bunch of tasks no matter the underlying system. Benchmark using `--configration=release` for both the `main` and `parallel-removal` branches: ``` $ rm -rf .build && git switch main && swift build --configuration=release && cp .build/release/NIOTestCLI NIOTestCLI_main [...] Building for production... [134/134] Linking NIOPerformanceTester Build complete! (32.65s) $ rm -rf .build && git switch parallel-removal && swift build --configuration=release [...] Building for production... [134/134] Linking NIOPerformanceTester Build complete! (32.19s) $ hyperfine --max-runs 10 --prepare "rm -rf /tmp/swift-nio && cp -r $(pwd) /tmp/swift-nio" "./NIOTestCLI_main /tmp/swift-nio" ".build/release/NIOTestCLI /tmp/swift-nio" "rm -rf /tmp/swift-nio" Benchmark 1: ./NIOTestCLI_main /tmp/swift-nio Time (mean ± σ): 3.019 s ± 0.032 s [User: 0.601 s, System: 11.123 s] Range (min … max): 2.985 s … 3.072 s 10 runs Benchmark 2: .build/release/NIOTestCLI /tmp/swift-nio Time (mean ± σ): 2.295 s ± 0.065 s [User: 0.673 s, System: 7.677 s] Range (min … max): 2.212 s … 2.370 s 10 runs Benchmark 3: rm -rf /tmp/swift-nio Time (mean ± σ): 1.517 s ± 0.037 s [User: 0.015 s, System: 1.442 s] Range (min … max): 1.480 s … 1.590 s 10 runs Summary rm -rf /tmp/swift-nio ran 1.51 ± 0.06 times faster than .build/release/NIOTestCLI /tmp/swift-nio 1.99 ± 0.05 times faster than ./NIOTestCLI_main /tmp/swift-nio ``` Resolves apple#2933
We don't need to do any state tracking, if we rely on a simple depth-first search. Each DFS is kicked off on a separate task, and we await all tasks within that tree before we delete the given root. This makes things faster than `rm -rf`.
1. Do not rely on Atomics to sum up deleted files. We can return values from the TaskGroup instead. 2. Close the directory handle as quickly as possible, to free up capacity.
Adds `maxDescriptor` as an associated value to `parallel`, so we can rate limit the number of open file descriptors (directories we are scanning at the same time)
Now rate-limits the number of open directories to `maxDescriptors`.
Attempting to not break the API
Turns out you cannot run the breakage check on macOS. After all, the easiest thing to do was to go into the container and run the check manually.
3311c18
to
4390a8e
Compare
Updated the PRs description with the latest benchmarks |
// Discover current directory and find all files/directories. Free up | ||
// the handle as fast as possible. | ||
let (directoriesToRecurseInto, itemsToDelete) = try await bucket.withToken { | ||
try await self.withDirectoryHandle(atPath: path) { directory in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could actually re-use part of removeItemSequentially
here. The entire try away self.withDirectoryHandle {}
piece can be made the same, if we go for changing when items are deleted (either in parallel in separate tasks, or still sequentially but before we dive into the next subdirectory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and refactored it in 338ff35
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good so far!
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
self.lock.lock() | ||
defer { self.lock.unlock() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use withLock
here, although you should return the waiter from the withLock
call and then resume it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got that to work. Not convinced it's more readable than this simple two-liner though, but happy to adapt if that's the way.
Make them rely on the default behavior instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit but otherwise this looks great!
// Discover current directory and find all files/directories. Free up | ||
// the handle as fast as possible. | ||
let (directoriesToRecurseInto, itemsToDelete) = try await bucket.withToken { | ||
try await self.withDirectoryHandle(atPath: path) { directory in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way to be honest.
Sources/NIOFileSystem/Internal/Concurrency Primitives/TokenBucket.swift
Outdated
Show resolved
Hide resolved
Refactor `removeItemSequentially` to use `_collectItemsInDirectory`. Also use the new function in `ParallelRemoval.swift`
We are passing it around different concurrency contexts, and need to mark it as `Sendable` for when we enable stricter Swift 6 checks.
at path: FilePath | ||
) async throws -> Int { | ||
var (subdirectories, filesRemoved) = try await self.withDirectoryHandle( | ||
func _collectItemsInDirectory(at path: FilePath) async throws -> ([FilePath], [FilePath]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add labels to the tuples? It's too easy to mix them up otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, do we even need to split them? We could just return the directory entries and the caller can then iterate over them and do as they need. This would avoid allocating an array unnecessarily as well (not that I think it'll have any meaningful impact on perf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get behind adding labels. If we only return a single array, then the callers would need to perform the sorting themselves, and thus replicate the bulk of what this utility does. I'm not wed to this existing, but I think if we have it then might as well use it to remove some level of duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick discussion offline. Reverted the addition of _collectItemsInDirectory
in e9df08e
This reverts commit 338ff35.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, thanks @mimischi!
API break is expected; we are changing the requirement to add the strategy
These fall out of that:
|
Introduce parallel removal of items, instead of only doing a sequential removal (and discovery) one-by-one
Motivation:
#2933 mentioned this function call appears to be slow. If we attempt to delete a directory with lots of subdirectories, we end up discovering each subdirectory one by one on a single thread.
Modifications:
This PR refactors the logic to remove items. Instead of sequentially discovering and deleting items one by one, we traverse all directories up to a maximum concurrency, and issue file deletions in parallel within each directory.
Result:
The new concurrent implementation was measured to be ~1.4x faster than
FileManager
, ~1.5x faster thanrm -rf
and ~3x faster than the current sequential implementation onmain
. This new concurrent implementation does use the most CPU out of all the other approaches.Results are broken down into two sections. We delete a copy of the
swift-nio
repository, where the.build/
folder has grown to a large size, providing lots of additional files and subdirectories.We produce a single CLI using
--configration=release
that can switch between using the.sequential
mode (what is currently available on themain
branch),.parallel
(the new concurrent implementation in this PR) andfilemanager
(usesFoundation.FileManager
as a comparison).Resolves #2933
Normal size
Description: The
swift-nio
library as found when building the CLI used for benchmarking.Benchmark run with: 3579 directories, 11657 files
Large size
Description: The
swift-nio
library as found after re-running the.github
workflow many times locally. I'm unable to reproduce this directory size, but have benchmarked against it.Benchmark run with: 15260 directories, 46842 files