Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A network computation feature for y-cruncher #7

Open
fffffgggg54 opened this issue Dec 24, 2017 · 30 comments
Open

A network computation feature for y-cruncher #7

fffffgggg54 opened this issue Dec 24, 2017 · 30 comments
Labels

Comments

@fffffgggg54
Copy link

Is it possible to make a distributed computing project for finding constants in y-cruncher?

@Mysticial Mysticial changed the title A metwork computation feature for y-cruncher A network computation feature for y-cruncher Dec 24, 2017
@fffffgggg54
Copy link
Author

The storage is not a large problem. Work for a given n range would be sent out, computed, and returned. Only one server would store the data, with backups. The largest problem would be communication. Only the result and verification would be sent. By how much does it help reduce file size?

@Mysticial
Copy link
Owner

Mysticial commented Dec 24, 2017

Work for a given n range would be sent out, computed, and returned.

This requires a digit extraction algorithm. Unfortunately, no such algorithm is known for the majority of constants (including Pi) that is efficient.

The most efficient such algorithm for Pi only works in base 2. And even that requires quadratic run-time to achieve all the digits from 1 to N.

Using the classic approaches: In order to compute the digits of Pi between 72 trillion and 73 trillion on a single computer, that computer will need O(73 trillion) bytes of data storage and quasi linear run-time. And doing so will give you all the digits from 1 - 72 trillion as well for free - thereby defeating the initial goal of distribution.

Only one server would store the data, with backups.

For computations upwards of 100 trillion digits and beyond, the expectation is that no single system will be large enough to hold either the final output (50 TB) or the intermediate steps (500 TB).


In short, you cannot "just hand out" different parts of Pi to different computers to compute. They must all work together from start to finish. This is akin to building a 100-story skyscraper, you cannot "just hand out" different floors to different contractors to build. They must work together to build each floor before they can build the floor above it. The parallelism is horizontal, not vertical.

@fffffgggg54
Copy link
Author

Okay, I see now. No algorithm that uses summation can be used as the results need to be added together. In a future version of y-cruncher, could you see if you could implement the Gauss-Legendre algorithm? I think that it could be used to distribute the finding of pi.

@Mysticial
Copy link
Owner

The Gauss-Legendre algorithm is even worse. Each iteration is dependent on the previous iteration. The only exploitable parallelism is within each iteration and they are extremely communication bound.

@ehfd
Copy link

ehfd commented Mar 30, 2021

How feasible is computation on new object storage or distributed parallel file systems like Ceph, SeaweedFS or OpenIO while one compute node still does the crunching? Latency will surely be higher but throughput will be better than block storage or traditional file systems.
How may y-cruncher also work with S3 compatible file systems (https://github.com/kahing/goofys) as well since this is the recommended object storage gateway?

@Mysticial
Copy link
Owner

The only thing that matters is the sustained I/O bandwidth. Latency does matter, but not nearly as much. So if the system can sustain 10GB/s+ of I/O, you should be in good shape.

@ehfd
Copy link

ehfd commented Apr 1, 2021

The only thing that matters is the sustained I/O bandwidth. Latency does matter, but not nearly as much. So if the system can sustain 10GB/s+ of I/O, you should be in good shape.

What I/O benchmark configuration in say FIO would represent real life y-cruncher well? I can't define what strided read/write exactly is in the y-cruncher benchmark (the sequential R/W in y-cruncher also does not match well with dd for example as well). Does IOPS factor in too?

@Mysticial
Copy link
Owner

y-cruncher's own I/O benchmark is the most representative since it literally calls the same routines.

@ehfd
Copy link

ehfd commented Apr 10, 2021

I found what the issue was.
For people who are using distributed file systems or object storage, turn off "Raw I/O" in Far Memory Config.
I think object storage could be the closest we could get to distributed computing since all we need is mediocre CPU performance.

Last thing: what is the "RawFile::close(): Unable to close file." Error? The IO bench finished as expected but this error keeps on showing. Is it because of POSIX incompliance?

Legacy R0/3 seems quite a bit faster than RAID0 as well, why is it legacy?

@Mysticial
Copy link
Owner

RawFile::close(): Unable to close file.

It just means that the POSIX close() function returned an error value. I've never really seen it and it's just there just for error checking practice. So I wouldn't know what's causing it in your case.

Legacy R0/3 seems quite a bit faster than RAID0 as well, why is it legacy?

It's legacy because it's slow (for local disks at least) and a huge pile of tech debt. So it's no longer maintained. It'll get removed as soon as the framework changes in a way that it takes a non-trivial effort to update it.

As for performance, the RAID0 by default has checksums enabled to catch silent data corruption (which happen despite hardware CRC). You can try turning that off to see if it speeds it up. But I don't recommend turning it off for any serious computations.

@ehfd
Copy link

ehfd commented May 4, 2021

It just means that the POSIX close() function returned an error value. I've never really seen it and it's just there just for error checking practice. So I wouldn't know what's causing it in your case.

https://github.com/chrislusf/seaweedfs/wiki/FUSE-Mount#supported-features
@Mysticial Any clues here?
https://docs.ceph.com/en/latest/cephfs/posix/
Here as well.

Question: when RawIO is disabled ("false"), does InterleaveWidth, BytesPerSeek change the performance of the filesystem when there is only one path in RAID0 and the underlying file system manages fault tolerance and parallelism? If yes, how? Making multiple paths pointing to the same filesystem and linking it by RAID0 (perhaps extracting some more bandwidth with parallelism) does also seems to help, this seems to multiply the bandwidth to some extent while only the Threshold Strided Write stays the same. How does InterleaveWidth, BytesPerSeek take part in this situation?

For example, when RawIO is disabled and I run the y-cruncher IO benchmark in BeeGFS (distributed parallel fault-tolerant file system) over NFS with one lane specified, I observe a situation where Sequential Write, Sequential Read, and Threshold Strided Read all are around 2.5-3GB/s and only Threshold Strided Write is mingling at 150 MB/s (which I suspect is the write speed of one hard drive lane in the underlying parallel file system).

I think when RawIO is "false" (or at least make a separate option for Object Storage with only the most basic POSIX operations) the POSIX operations done should be refrain from doing random writes and minimize appending files, and also should not use low level interfaces such as fsync, making the interface closer to what the Object Storage API does best (the most basic POSIX operations like sequential file write and creating new file/directory functions, else it will throw loads of errors during computation).

Object storage is likely to be the solution that would make possible "network computation" at least for the I/O component. The y-cruncher program is an edge case of very high I/O intensity and bandwidth requirement, and proper support can prove object storage is viable for basically every real-life workload that doesn't require latency or IOPS (check https://www.openio.io/blog/storage-systems-performance). Would really like this to be properly investigated.

I got an error message of file I/O failure while using CephFS, don't remember what it was. Will reproduce and post here to check if this is because of unsupported low-level POSIX operations.

@Mysticial
Copy link
Owner

Question: when RawIO is disabled ("false"), does InterleaveWidth, BytesPerSeek change the performance of the filesystem when there is only one path in RAID0 and the underlying file system manages fault tolerance and parallelism? If yes, how?

BytesPerSeek affects the algorithm selection. So that's way above the I/O layer so the answer is yes it will have an effect. With only one logical path, InterleaveWidth, will still affect how the memory is copied from the DMA buffer to/from the program's compute memory. But I believe the disk access pattern remains the same all other factors equal. (InterleaveWith may put constraints on other parameters.)

Making multiple paths pointing to the same filesystem and linking it by RAID0 (perhaps extracting some more bandwidth with parallelism) does also seems to help, this seems to multiply the bandwidth to some extent while only the Threshold Strided Write stays the same.

The program uses a dedicated I/O thread for each logical path you give it. So if the device is fast enough that a single thread cannot keep up, then using multiple threads will (by using multiple logical paths) will speed it up. I believe this might be the case for NVMe drives and such, though a sufficiently large hard drive array presenting itself as a single logical path to y-cruncher can achieve the same.

The Threshold Strided patterns are inherently slower from the device-side. So it's easier for one thread to keep up with all the work.

I think when RawIO is "false" (or at least make a separate option for Object Storage with only the most basic POSIX operations) the POSIX operations done should be refrain from doing random writes and minimize appending files

This is actually most important on Windows. The default buffering is so bad that it's completely unusable.

Unfortunately, non-sequential access is unavoidable for y-cruncher. Thus the need to go through this entire mess.

@ehfd
Copy link

ehfd commented May 5, 2021

However, there are a few places where CephFS diverges from strict POSIX semantics for various reasons:
If a client is writing to a file and fails, its writes are not necessarily atomic. That is, the client may call write(2) on a file opened with O_SYNC with an 8 MB buffer and then crash and the write may be only partially applied. (Almost all file systems, even local file systems, have this behavior.)
In shared simultaneous writer situations, a write that crosses object boundaries is not necessarily atomic. This means that you could have writer A write “aa|aa” and writer B write “bb|bb” simultaneously (where | is the object boundary), and end up with “aa|bb” rather than the proper “aa|aa” or “bb|bb”.
Sparse files propagate incorrectly to the stat(2) st_blocks field. Because CephFS does not explicitly track which parts of a file are allocated/written, the st_blocks field is always populated by the file size divided by the block size. This will cause tools like du(1) to overestimate consumed space. (The recursive size field, maintained by CephFS, also includes file “holes” in its count.)
When a file is mapped into memory via mmap(2) on multiple hosts, writes are not coherently propagated to other clients’ caches. That is, if a page is cached on host A, and then updated on host B, host A’s page is not coherently invalidated. (Shared writable mmap appears to be quite rare–we have yet to here any complaints about this behavior, and implementing cache coherency properly is complex.)
CephFS clients present a hidden .snap directory that is used to access, create, delete, and rename snapshots. Although the virtual directory is excluded from readdir(2), any process that tries to create a file or directory with the same name will get an error code. The name of this hidden directory can be changed at mount time with -o snapdirname=.somethingelse (Linux) or the config option client_snapdir (libcephfs, ceph-fuse).

(https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/ceph_file_system_guide_technology_preview/what_is_the_ceph_file_system_cephfs#differences_from_posix_compliance and https://docs.ceph.com/en/latest/cephfs/posix/)

With "weed mount", the files can be operated as a local file. The following operations are supported.
file read / write
create new file
mkdir
list
remove
rename
chmod
chown
soft link
hard link
display free disk space

(https://github.com/chrislusf/seaweedfs/wiki/FUSE-Mount#supported-features)

It seems like object storage systems do all support non-sequential access using their FUSE interface. Most programs work very well on object storage FUSE interfaces without error (whether they do random read/writes, append or etc.) as long as IOPS is not a factor. It is just some low-level POSIX operations that cause the error only on y-cruncher even with RawIO off.

I got an error message of file I/O failure while using CephFS, don't remember what it was. Will reproduce and post here to check if this is because of unsupported low-level POSIX operations.

This is in progress.

@ehfd
Copy link

ehfd commented May 8, 2021

RawFile::close(): Unable to close file.
It just means that the POSIX close() function returned an error value. I've never really seen it and it's just there just for error checking practice. So I wouldn't know what's causing it in your case.

I found that this message showed up in Timothy Mullican's Pi record computation screenshot. I guess not a big deal.

I got an error message of file I/O failure while using CephFS, don't remember what it was. Will reproduce and post here to check if this is because of unsupported low-level POSIX operations.

I divided the same directory into four paths and it seemed to have went past that phase. Will continue to see if it poses any problems.

@ehfd
Copy link

ehfd commented May 10, 2021

I got an error message of file I/O failure while using CephFS, don't remember what it was. Will reproduce and post here to check if this is because of unsupported low-level POSIX operations.

@Mysticial

Writing Hexadecimal Digits:
Time:    29376.495 seconds  ( 8.160 hours )
Base Converting:
Time:    27269.042 seconds  ( 7.575 hours )
Writing Decimal Digits:
Writing...    1,059,014,400,000  digits written
Time:    32105.920 seconds  ( 8.918 hours )

Destroying checkpoint with live checkpoint files.

Attempting to destroy a FileAllocator with active allocations.
List of Active Allocations:
    ycr_checkpoint57_0_Final.sf


Exception Encountered: FileException

In Function:        RawFile::load()
Error Code:         22
Path:   /scratch/cruncher/Pi - Dec - Chudnovsky.txt

lseek64() failed.

Invalid parameter. This can happen if the sector alignment is too small.

@Mysticial
Copy link
Owner

So it looks like writing the digits to disk also uses Raw I/O. And I don't see a way to turn off the Raw I/O.
It's also using a sector alignment of 4096 - which appears to be hardcoded with no way to adjust it.

Could this be the problem? The digit output doesn't get as much configurability attention as the swap paths since it's not as performance critical.

@Mysticial
Copy link
Owner

@ehfd
Copy link

ehfd commented May 10, 2021

So it looks like writing the digits to disk also uses Raw I/O. And I don't see a way to turn off the Raw I/O.
It's also using a sector alignment of 4096 - which appears to be hardcoded with no way to adjust it.

Could this be the problem? The digit output doesn't get as much configurability attention as the swap paths since it's not as performance critical.

I would say there is a need to place away all remnants of Raw I/O when it is set to false. I saw something that was related to the RawFile namespace on the error message 57% into the Pi series computation of 1T hex and 1.2T dec on another computation as well. This means raw I/O also exists during computation and thus is prone to errors.

I did some searches and object storage systems support most POSIX operations because not supporting that will lead to errors on other applications as well. It definitely shows an error for RawIO underneath POSIX.

The next phase to Pi will definitely be nearly petascale and traditional storage servers will not work out. It has to be parallel using distributed parallel fault-tolerant systems or object storage systems. These filesystems usually do not support low-level raw I/O underneath POSIX interfaces because the normal file read/writes are done in the most efficient way underneath the distributed system and abstracted in the user interface. This is arguably the "network computation" y-cruncher can utilize.

Could you do some searches or screening in the codebase to see if there are RawIO sections even when it is set to false?

@Mysticial
Copy link
Owner

The only things that use Raw I/O are the swap files and the output digits. Disabling Raw I/O in the swap configuration should completely disable it for the computation.

So I'd expect it to not work at all, or work completely. Not 57% in.

Off the top of my head, the only other things which touch the file system are:

  1. Checksum files for the swap files. No Raw I/O. Just the C FILE interface.
  2. Checkpoint files. Same as above.
  3. Config files. Didn't check, but these aren't used during a computation.

Adding configurability to the output files will take some work since there's no menu for it atm.

The next phase to Pi will definitely be nearly petascale and traditional storage servers will not work out. It has to be parallel using distributed parallel fault-tolerant systems or object storage systems.

What this calls for is a new Far Memory framework optimized for network storage. I foresaw this years ago and added the interface for it. Currently, there are two implementations: the legacy R03 and the newer Disk Raid 0

The interface actually includes the gather/scatter operations which the math algorithms call directly and could potentially be forwarded to a filesystem implementation that native supports them. Currently, both the R03 and Disk Raid 0 implementations break them up serially relying only on NCQ for any optimizations.

So it seems that things have finally caught up. Not sure if/when I'll get the time to look at the distributed filesystem APIs.

@ehfd
Copy link

ehfd commented May 14, 2021

@Mysticial

So I'd expect it to not work at all, or work completely. Not 57% in.

Error successfully reproduced.

Auto-Selecting: 13-HSW ~ Airi

/scratch/cruncher/y-cruncher/Binaries/13-HSW ~ Airi


Launching y-cruncher...
================================================================



Insufficient permissions to set thread priority. Please retry as root.

Further messages for this warning will be suppressed.

Checking processor/OS features...

Required Features:
    x64, ABM, BMI1, BMI2,
    SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,
    AVX, FMA3, AVX2



Parsing Core -> Handle Mappings...
    Cores:  0-71

Parsing NUMA -> Core Mappings...
    Node  0:  0-17 36-53
    Node  1:  18-35 54-71

Constant:   Pi
Algorithm:  Chudnovsky (1988) (reduced memory)

Decimal Digits:       1,204,119,982,655
Hexadecimal Digits:   1,000,000,000,000

Computation Mode:     Swap Mode
Multi-Threading:      Push Pool  ->  144 / ?  (randomization on)
Far Memory Config:    Disk Raid 0:  /scratch/cruncher
Far Memory Tuning:    80.0 MiB/seek, no parallelism

Start Time: Tue May 11 08:33:09 2021

Working Memory...   668 GiB  (spread: 98%/2)
Twiddle Tables...   224 MiB  (spread: 97%/2)
I/O Buffers...     2.00 GiB  (spread: 100%/2)

This process does not have "CAP_IPC_LOCK". Page locking will not be possible.
Please run y-cruncher with elevation to enable page locking.

Begin Computation:

Series CommonP2B3...  84,906,918,325 terms  (Expansion Factor = 3.351)
Summing: 8%  ( 9 ) E ( 108,206,950,423 )
fallocate() is not supported on this volume. Expect performance degradation.

Path: : /scratch/cruncher/ycs-0/ycr_00005589c36fab50.sf

Further messages for this warning will be suppressed.

Summing: 54%  ( 2 ) S ( 653,314,534,035 ) 1C-R0 MA-
IoThread::thread_loop(): An exception was thrown from a worker thread.


System Exception Encountered: SystemException

In Function:        RawFile::set_size()
Error Code:         27

ftruncate() failed.

Checkpoint:

y-cruncher Checkpoint File - DO NOT EDIT!!!

///////////////////////////////////////////////////////////////////////////////
//  -3

Program Version:        0.7.8 Build 9507 (Linux/13-HSW ~ Airi)
Sequence Number:        31
Largest Checkpoint:     1189551828752

Hash:   (REDACTED)

///////////////////////////////////////////////////////////////////////////////
//  -2
{
    Constant : {
        Constant : "pi"
        Algorithm : "chudnovsky-reduced"
    }
    ComputeSize : {
        DecimalDigits : 1204119982655
        EnableHexDigits : "true"
    }
    Output : {
        Path : "/scratch/cruncher"
        OutputEnable : "true"
        DigitsPerFile : 0
    }
    OutputVerify : "true"
    Mode : "swap"
    Parallelism : {
        TaskDecomposition : 144
        Framework : "pushpool"
        WorkerThreads : 0
        Randomization : "true"
        MaxSequentialDispatch : 64
    }
    Allocator : {
        Allocator : "interleave-libnuma"
        LockedPages : "attempt"
        Nodes : (REDACTED)
    }
    Memory : 719407022080
    Checkpointing : {
        Enabled : "true"
        PostCheckpointCommand : ""
    }
    FarMemoryTuning : {
        BytesPerSeek : 83886080
        ParallelAccess : "none"
    }
    FarMemoryConfig : {
        Framework : "disk-raid0"
        InterleaveWidth : 262144
        BufferPerLane : 2147483648
        Checksums : "true"
        RawIO : "false"
        Lanes : [
            {   //  Lane 0
                Path : "/scratch/cruncher"
                BufferAllocator : {
                    Allocator : "interleave-libnuma"
                    LockedPages : "attempt"
                    Nodes : (REDACTED)
                }
                WorkerThreadCores : (REDACTED)
                WorkerThreadPriority : 2
            }
        ]
    }
}

///////////////////////////////////////////////////////////////////////////////
//  -1
seed              :     (REDACTED)
date_start        :     (REDACTED)
epoch_millis_start:     (REDACTED)
total.wall_time   :     (REDACTED)
total.user_time   :     (REDACTED)
total.kernel_time :     (REDACTED)
{
    bytes_peak : 3426291781112
    bytes_read : 33631799604928
    bytes_written : 30767304274864
}

///////////////////////////////////////////////////////////////////////////////
//  0
Frame:  NormalFloatSessionS::run()
Step:   0

///////////////////////////////////////////////////////////////////////////////
//  1
Frame:  Functions::Pi_Series_SwapFloatO::run()
Step:   0

///////////////////////////////////////////////////////////////////////////////
//  2
Frame:  BinarySplitting::CommonP2B3::PSR_SeriesS::run()
Step:   0

///////////////////////////////////////////////////////////////////////////////
//  3
Frame:  BinarySplitting::CommonP2B3::PSR_SeriesPartsS::run()
Step:   1
PSR:    2
Object: ycr_checkpoint24_2_P1.sf
    T   :       0
    exp :       81778118541
    L   :       33910373522
    AL  :       33910373526
    sign:       1
Object: ycr_checkpoint24_2_Q1.sf
    T   :       0
    exp :       81778118540
    L   :       33910373522
    AL  :       33910373525
    sign:       1

///////////////////////////////////////////////////////////////////////////////
//  4
Frame:  BinarySplitting::CommonP2B3::PSR_BlockS::run()
Step:   1
Index a:        26182149167
Index b:        38839313295
Object: ycr_checkpoint31_0_P0.sf
    T   :       0
    exp :       0
    L   :       31250000005
    AL  :       31250000009
    sign:       1
Object: ycr_checkpoint31_0_Q0.sf
    T   :       8
    exp :       3559827411
    L   :       27690172594
    AL  :       27690172597
    sign:       1
Object: ycr_checkpoint31_0_R0.sf
    T   :       0
    exp :       0
    L   :       21933058892
    AL  :       21933058897
    sign:       1

///////////////////////////////////////////////////////////////////////////////
//  2147483647
Tue May 11 08:33:09 2021        0.023   Working Memory
Tue May 11 08:35:08 2021        119.375 Working Memory:  668 GiB  (spread: 98%/2)
Tue May 11 08:35:08 2021        119.376 Twiddle Tables
Tue May 11 08:35:09 2021        119.504 Twiddle Tables:  224 MiB  (spread: 97%/2)
Tue May 11 08:35:09 2021        119.504 I/O Buffers
Tue May 11 08:35:09 2021        119.893 I/O Buffers: 2.00 GiB  (spread: 100%/2)
Tue May 11 08:35:09 2021        119.893 Begin Computation
Tue May 11 08:35:09 2021        119.893 Series CommonP2B3...  84,906,918,325 terms  (Expansion Factor = 3.351)
Tue May 11 08:35:09 2021        119.893 Series: A ( 40 ) 0.000%
Tue May 11 08:35:09 2021        119.944 Series: A ( 39 ) 0.000%
Tue May 11 08:35:10 2021        120.932 Series: A ( 38 ) 0.003%
Tue May 11 08:35:11 2021        122.136 Series: A ( 37 ) 0.006%
Tue May 11 08:35:12 2021        123.393 Series: E ( 36 ) 0.009%
Tue May 11 08:35:14 2021        124.706 Series: E ( 35 ) 0.012%
(REDACTED)
Tue May 11 10:12:20 2021        5951.121        Series: E ( 9 ) 8.986%
Tue May 11 10:47:24 2021        8054.836        Series: S ( 8 ) 11.604%
Tue May 11 12:04:53 2021        12703.692       Series: S ( 7 ) 14.987%
Tue May 11 13:52:01 2021        19132.291       Series: S ( 6 ) 19.363%
Tue May 11 16:42:49 2021        29380.122       Series: S ( 5 ) 25.025%
Wed May 12 03:16:01 2021        67371.952       Series: S ( 4 ) 32.359%
Wed May 12 17:41:45 2021        119316.004      Series: S ( 3 ) 41.876%
Thu May 13 15:31:41 2021        197912.246      Series: S ( 2 ) 54.257%

Strange to find that Chudnovsky (1988) (reduced memory) causes an error mid-computation when Chudnovsky (1988) goes through this phase all the way to writing the decimal digits.

Maybe requires a look at all the constants, custom formulas, and the binary splitting algorithms.

@Mysticial
Copy link
Owner

Mysticial commented May 14, 2021

ftruncate() failed with POSIX error code 27 which means "The argument length is larger than the maximum file size. (XSI)".

I've never seen this before, but does your file system impose a size limit for files?

ftruncate() is not a low-level operation. But fallocate() is - which the program has a fallback for.

@ehfd
Copy link

ehfd commented May 15, 2021

While it is not a low level operation, this is probably something that object storage does not support well.
Object storage stores files as objects, not blocks. Similar POSIX operations that assume that file systems can access blocks directly would also not work out in object storage.
These things are normally handled by the underlying file system itself (they have their own ways of clearing out files and other dirty stuff) and thus in FUSE only basic interfaces are revealed (file read/write/append, create new file, mkdir, list, remove, rename, chmod/chown, soft/hard link, df -h).

https://indico.cern.ch/event/304944/contributions/1672715/attachments/578894/797101/dfs.pdf

@Mysticial
Copy link
Owner

fallocate() is a performance optimization. ftruncate() is not. It is an essential part of the interface as it is the only way to set the size of a file to a specific size as part of the program (especially the digit output) require the concept of a file size.

The problems here seem to be stemming from a non-filesystem trying to fake itself as a filesystem, but coming up short on multiple fronts. There's only so much that can be done to bridge gaps like that before it simply won't work.

Both the R03 and Raid0 far memory implementations are filesystem implementations that use a filesystem interface and thus require that the underlying system adequately supplies that interface. If they can't, then these cannot be used and something else is needed. Which in this case means, you're reaching the limit of y-cruncher's current capability as it doesn't have any far-memory implementation for such object storage.

@ehfd
Copy link

ehfd commented May 15, 2021

Alright, understood.
While CephFS emulates POSIX best out of object storage systems, distributed file systems which are not object storage systems like BeeGFS, Lustre, GlusterFS work well with ftruncate. I will focus on these file systems for now.

Other object storage systems are inherently based on S3. Most FUSE implementations they provide are wrappers to S3. If you could implement a new far-memory configuration, S3 itself could be a target, since there is an increasing trend of libraries including Spack and Tensorflow directly accessing the S3 protocol. But this requires a LOT of development.

@ehfd
Copy link

ehfd commented Jun 9, 2021

The relevant code is:

https://github.com/Mysticial/y-cruncher/blob/master/trunk/Source/DigitViewer2/DigitWriters/BasicTextWriter.cpp#L34

which calls:

https://github.com/Mysticial/y-cruncher/blob/master/trunk/Source/PublicLibs/SystemLibs/FileIO/RawFile/RawFile_Linux.h#L43

It sets the sector alignment to FileIO::DEFAULT_FILE_ALIGNMENT_K which is 2^12 = 4096 and leaves the raw_io defaulting to true.

The error is then thrown from:

https://github.com/Mysticial/y-cruncher/blob/master/trunk/Source/PublicLibs/SystemLibs/FileIO/RawFile/RawFile_Linux.ipp#L467

I tried y-cruncher on a block storage layer (RADOS Block device) formatted with XFS on top of my object storage (Ceph) system instead of the CephFS FUSE interface, it is 3-4 times slower on the same infrastructure. Could you make a quick patch where if RawIO is off on far memory, digit writer also turns it off? The people playing with distributed filesystems could then suffice with this for a long time until a new interface comes up.

Then an interface directly connecting to the S3 and/or Swift API could be considered in the long term since the object storage movement would expand.

@ehfd
Copy link

ehfd commented Jun 9, 2021

ftruncate() failed with POSIX error code 27 which means "The argument length is larger than the maximum file size. (XSI)".

I've never seen this before, but does your file system impose a size limit for files?

ftruncate() is not a low-level operation. But fallocate() is - which the program has a fallback for.

And at the same time, could you reconfirm something? When using Chudnovsky (reduced), with RawIO off in the settings, (leaving the ftruncate error message aside for now) is y-cruncher supposed to use this RawFile::set_size() operation shown in my error message, which seems to be a method from the RawFile namespace, also in other algorithms that are not "reduced" in the first place?
I looked at the commit in the code base of for the file system I am using and I might have focused on an issue not totally related to this.
I am asking because even if ftruncate is officially supported in a nearly POSIX file system, the error on ftruncate could maybe show because the program was involved in another operation that is considered raw and the ftruncate operation was thrown in the midst of that. Then it is also understandable to see my non-reduced Chudnovsky computation going until the end with CephFS other than the digit writing interface if RawFile::set_size() is only a thing for reduced algorithms.

@Mysticial
Copy link
Owner

Could you make a quick patch where if RawIO is off on far memory, digit writer also turns it off?

There's a new build that I'm still testing that will add RawIO configurability to the digit output. Give me a couple weeks as I'll be traveling.

There's no API usage difference between the normal and reduced algorithms. set_size() would be called everywhere - on every single swap file.

The only difference would be the sizes and instances of the allocations. I don't recall off the top of my head, but the reduced algorithms may be creating larger files (though fewer of them since the overall storage usage is less). So if that pushes over the limit, then it could be the cause of this error.

@ehfd
Copy link

ehfd commented Jun 12, 2021

Great, after the new build is out, I will continue testing and see if issues are fixed.

@Mysticial
Copy link
Owner

Just an update. The new build is taking longer than usual. Since the project had been on hiatus for about 2 years now, rolling forward 2 years of compiler updates isn't that straight-forward. So I'm currently running my "long" suite of integration tests.

@ehfd
Copy link

ehfd commented Nov 29, 2021

ftruncate() failed with POSIX error code 27 which means "The argument length is larger than the maximum file size. (XSI)".

I've never seen this before, but does your file system impose a size limit for files?

ftruncate() is not a low-level operation. But fallocate() is - which the program has a fallback for.

I can finally reproduce that this error is indeed an issue with size limit. There was a configuration related to max file size with Ceph that we didn't know of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants