-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipe name collisions (all pipe instances are busy) #481
Comments
I think it's better to make this configurable rather than fixing a specific structure. Also, when processing large number of files like 100K concurrently, would it not be better to extract the logic into a CLI and run it through containers, so it can also be scalable on the cloud as well and any temporary or undeleted files won't be a concern? In my opinion if the library supports processing just one file at a given time, that's good enough. The run instances should be encapsulated from each other in isolated environments. Otherwise the question will be, how many files this library should support to process, 1K, 100K, 10M? But maybe you have a more specific use case? |
@ep1kt3t0s Thanks for weighing in.
In my opinion surfacing such implementation details and forcing the library user to choose sensible values would be an unnecessary burden. I'd prefer if the library would just work without issues without jumping through additional configuration hoops. Is there any reason why you would be opposed to using 16 random characters in the pipe names?
I'd rather not invest engineering resources into the additional complexity and orchestration of containers when there's a simpler solution at hand. We're never going to run our workload in the cloud anyway.
Why shouldn't the library support processing multiple files concurrently when the effort to do so is low? Requiring single file processing severely limits the usefulness of the library. E.g. you wouldn't be able to create a desktop application that allows concurrent batch processing of multimedia files. Shipping a container runtime with a desktop application does not sound sensible to me. Besides, the intent of the library author is to support concurrent processing, which is evident from the fact that the pipe name is randomized at all. Otherwise, two fixed pipe names (one for input and one for output) would suffice.
I don't understand why concurrent processing in a single environment would require to define a limit of files that can be processed. The important factor is how probable a name collision is when processing X files concurrently. Processing more files just increases the likeliness that at least one collision occurs, but doesn't affect the overall probability. |
There can be a default convention with a possibility of configuration change in the library but my point stands the same. You don't want to invest engineering hours to support processing 100K files but then you ask for others to change the library to support your extensive use case. |
There's a difference between a single line change (which I am happy to do myself) and the work needed for infrastructure to spin up and orchestrate a swarm of containers. I'd rather invest into a simpler solution, both in terms of implementation effort as well as future maintenance. As there haven't been any objections to my proposed changes, I went ahead and submitted a PR. |
When processing a large number of files (> 100,000) in parallel (>= 8 threads) using pipes as inputs and outputs we frequently get this exception:
We're running .NET 6.0 on Windows 10/11.
I believe this problem was introduced by PR #374, which has shortened the random part of the pipe name to 5 hex digits. This leads to multiple pipes accidentally sharing the same name.
With 5 hex digits there are only 16^5 = 1,048,576 different pipe names to choose from. To demonstrate the high likeliness of collisions, consider the following program that simulates the concurrent use of pipes:
(run it on .NET Fiddle)
Here are a couple of results:
I understand that using a full GUID for the random part has hit a name length limitation on Mac OS (#365). But shortening the random part to only 5 hex digits is unwarranted (IMO).
This is the reported error from the issue:
The pipe path here has a length of 107 characters. The maximum length is 104 characters. Dropping the dashes from the GUID string using
ToString("N")
already loses 4 characters, barely bringing it into the required length range with 103 characters. To get some breathing room, dropping an additional 16 of the GUID characters brings us down to 87 characters, well within the allowed length range.This still leaves us with 16 hex characters for the random part, which massively increases the number of unique pipe names to 18,446,744,073,709,551,616. Collisions should be orders of magnitude less likely to occur.
(Note that to maximize randomness, one should not use the first 16 characters of the GUID string. This would include the character at index 13, which is always
4
and provides no additional randomness. Instead use e.g. the last 16 characters.)Therefore my suggestion is to use 16 characters of the GUID instead of 5.
@ep1kt3t0s Pinging you to get your thoughts on this, as you submitted the change that shortened the GUID to 5 hex characters.
The text was updated successfully, but these errors were encountered: