-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A network computation feature for y-cruncher #7
Comments
The storage is not a large problem. Work for a given n range would be sent out, computed, and returned. Only one server would store the data, with backups. The largest problem would be communication. Only the result and verification would be sent. By how much does it help reduce file size? |
This requires a digit extraction algorithm. Unfortunately, no such algorithm is known for the majority of constants (including Pi) that is efficient. The most efficient such algorithm for Pi only works in base 2. And even that requires quadratic run-time to achieve all the digits from 1 to N. Using the classic approaches: In order to compute the digits of Pi between 72 trillion and 73 trillion on a single computer, that computer will need O(73 trillion) bytes of data storage and quasi linear run-time. And doing so will give you all the digits from 1 - 72 trillion as well for free - thereby defeating the initial goal of distribution.
For computations upwards of 100 trillion digits and beyond, the expectation is that no single system will be large enough to hold either the final output (50 TB) or the intermediate steps (500 TB). In short, you cannot "just hand out" different parts of Pi to different computers to compute. They must all work together from start to finish. This is akin to building a 100-story skyscraper, you cannot "just hand out" different floors to different contractors to build. They must work together to build each floor before they can build the floor above it. The parallelism is horizontal, not vertical. |
Okay, I see now. No algorithm that uses summation can be used as the results need to be added together. In a future version of y-cruncher, could you see if you could implement the Gauss-Legendre algorithm? I think that it could be used to distribute the finding of pi. |
The Gauss-Legendre algorithm is even worse. Each iteration is dependent on the previous iteration. The only exploitable parallelism is within each iteration and they are extremely communication bound. |
How feasible is computation on new object storage or distributed parallel file systems like Ceph, SeaweedFS or OpenIO while one compute node still does the crunching? Latency will surely be higher but throughput will be better than block storage or traditional file systems. |
The only thing that matters is the sustained I/O bandwidth. Latency does matter, but not nearly as much. So if the system can sustain 10GB/s+ of I/O, you should be in good shape. |
What I/O benchmark configuration in say FIO would represent real life y-cruncher well? I can't define what strided read/write exactly is in the y-cruncher benchmark (the sequential R/W in y-cruncher also does not match well with dd for example as well). Does IOPS factor in too? |
y-cruncher's own I/O benchmark is the most representative since it literally calls the same routines. |
I found what the issue was. Last thing: what is the "RawFile::close(): Unable to close file." Error? The IO bench finished as expected but this error keeps on showing. Is it because of POSIX incompliance? Legacy R0/3 seems quite a bit faster than RAID0 as well, why is it legacy? |
It just means that the POSIX close() function returned an error value. I've never really seen it and it's just there just for error checking practice. So I wouldn't know what's causing it in your case.
It's legacy because it's slow (for local disks at least) and a huge pile of tech debt. So it's no longer maintained. It'll get removed as soon as the framework changes in a way that it takes a non-trivial effort to update it. As for performance, the RAID0 by default has checksums enabled to catch silent data corruption (which happen despite hardware CRC). You can try turning that off to see if it speeds it up. But I don't recommend turning it off for any serious computations. |
https://github.com/chrislusf/seaweedfs/wiki/FUSE-Mount#supported-features Question: when For example, when I think when Object storage is likely to be the solution that would make possible "network computation" at least for the I/O component. The y-cruncher program is an edge case of very high I/O intensity and bandwidth requirement, and proper support can prove object storage is viable for basically every real-life workload that doesn't require latency or IOPS (check https://www.openio.io/blog/storage-systems-performance). Would really like this to be properly investigated. I got an error message of file I/O failure while using CephFS, don't remember what it was. Will reproduce and post here to check if this is because of unsupported low-level POSIX operations. |
The program uses a dedicated I/O thread for each logical path you give it. So if the device is fast enough that a single thread cannot keep up, then using multiple threads will (by using multiple logical paths) will speed it up. I believe this might be the case for NVMe drives and such, though a sufficiently large hard drive array presenting itself as a single logical path to y-cruncher can achieve the same. The Threshold Strided patterns are inherently slower from the device-side. So it's easier for one thread to keep up with all the work.
This is actually most important on Windows. The default buffering is so bad that it's completely unusable. Unfortunately, non-sequential access is unavoidable for y-cruncher. Thus the need to go through this entire mess. |
(https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/ceph_file_system_guide_technology_preview/what_is_the_ceph_file_system_cephfs#differences_from_posix_compliance and https://docs.ceph.com/en/latest/cephfs/posix/)
(https://github.com/chrislusf/seaweedfs/wiki/FUSE-Mount#supported-features) It seems like object storage systems do all support non-sequential access using their FUSE interface. Most programs work very well on object storage FUSE interfaces without error (whether they do random read/writes, append or etc.) as long as IOPS is not a factor. It is just some low-level POSIX operations that cause the error only on y-cruncher even with
This is in progress. |
I found that this message showed up in Timothy Mullican's Pi record computation screenshot. I guess not a big deal.
I divided the same directory into four paths and it seemed to have went past that phase. Will continue to see if it poses any problems. |
|
So it looks like writing the digits to disk also uses Raw I/O. And I don't see a way to turn off the Raw I/O. Could this be the problem? The digit output doesn't get as much configurability attention as the swap paths since it's not as performance critical. |
The relevant code is: which calls: It sets the sector alignment to The error is then thrown from: |
I would say there is a need to place away all remnants of Raw I/O when it is set to false. I saw something that was related to the RawFile namespace on the error message 57% into the Pi series computation of 1T hex and 1.2T dec on another computation as well. This means raw I/O also exists during computation and thus is prone to errors. I did some searches and object storage systems support most POSIX operations because not supporting that will lead to errors on other applications as well. It definitely shows an error for RawIO underneath POSIX. The next phase to Pi will definitely be nearly petascale and traditional storage servers will not work out. It has to be parallel using distributed parallel fault-tolerant systems or object storage systems. These filesystems usually do not support low-level raw I/O underneath POSIX interfaces because the normal file read/writes are done in the most efficient way underneath the distributed system and abstracted in the user interface. This is arguably the "network computation" y-cruncher can utilize. Could you do some searches or screening in the codebase to see if there are |
The only things that use Raw I/O are the swap files and the output digits. Disabling Raw I/O in the swap configuration should completely disable it for the computation. So I'd expect it to not work at all, or work completely. Not 57% in. Off the top of my head, the only other things which touch the file system are:
Adding configurability to the output files will take some work since there's no menu for it atm.
What this calls for is a new Far Memory framework optimized for network storage. I foresaw this years ago and added the interface for it. Currently, there are two implementations: the legacy R03 and the newer Disk Raid 0 The interface actually includes the gather/scatter operations which the math algorithms call directly and could potentially be forwarded to a filesystem implementation that native supports them. Currently, both the R03 and Disk Raid 0 implementations break them up serially relying only on NCQ for any optimizations. So it seems that things have finally caught up. Not sure if/when I'll get the time to look at the distributed filesystem APIs. |
Error successfully reproduced.
Checkpoint:
Strange to find that Maybe requires a look at all the constants, custom formulas, and the binary splitting algorithms. |
I've never seen this before, but does your file system impose a size limit for files?
|
While it is not a low level operation, this is probably something that object storage does not support well. https://indico.cern.ch/event/304944/contributions/1672715/attachments/578894/797101/dfs.pdf |
The problems here seem to be stemming from a non-filesystem trying to fake itself as a filesystem, but coming up short on multiple fronts. There's only so much that can be done to bridge gaps like that before it simply won't work. Both the R03 and Raid0 far memory implementations are filesystem implementations that use a filesystem interface and thus require that the underlying system adequately supplies that interface. If they can't, then these cannot be used and something else is needed. Which in this case means, you're reaching the limit of y-cruncher's current capability as it doesn't have any far-memory implementation for such object storage. |
Alright, understood. Other object storage systems are inherently based on S3. Most FUSE implementations they provide are wrappers to S3. If you could implement a new far-memory configuration, S3 itself could be a target, since there is an increasing trend of libraries including Spack and Tensorflow directly accessing the S3 protocol. But this requires a LOT of development. |
I tried y-cruncher on a block storage layer (RADOS Block device) formatted with XFS on top of my object storage (Ceph) system instead of the CephFS FUSE interface, it is 3-4 times slower on the same infrastructure. Could you make a quick patch where if RawIO is off on far memory, digit writer also turns it off? The people playing with distributed filesystems could then suffice with this for a long time until a new interface comes up. Then an interface directly connecting to the S3 and/or Swift API could be considered in the long term since the object storage movement would expand. |
And at the same time, could you reconfirm something? When using Chudnovsky (reduced), with RawIO off in the settings, (leaving the ftruncate error message aside for now) is y-cruncher supposed to use this |
There's a new build that I'm still testing that will add RawIO configurability to the digit output. Give me a couple weeks as I'll be traveling. There's no API usage difference between the normal and reduced algorithms. set_size() would be called everywhere - on every single swap file. The only difference would be the sizes and instances of the allocations. I don't recall off the top of my head, but the reduced algorithms may be creating larger files (though fewer of them since the overall storage usage is less). So if that pushes over the limit, then it could be the cause of this error. |
Great, after the new build is out, I will continue testing and see if issues are fixed. |
Just an update. The new build is taking longer than usual. Since the project had been on hiatus for about 2 years now, rolling forward 2 years of compiler updates isn't that straight-forward. So I'm currently running my "long" suite of integration tests. |
I can finally reproduce that this error is indeed an issue with size limit. There was a configuration related to max file size with Ceph that we didn't know of. |
Is it possible to make a distributed computing project for finding constants in y-cruncher?
The text was updated successfully, but these errors were encountered: