Why does the documentation suggest "tiered caching" require 2 separate pools? I'm looking to replace unraid with mergerfs #1081

TheLinuxGuy · 2022-10-23T09:48:52Z

TheLinuxGuy
Oct 23, 2022

I think the documentation explaining "tiered caching" is a bit unclear as it mentions 2 mergerfs pools are required to be setup in order to achieve a simple cache but it doesn't explain why that is or if it would be prudent for my use-case.

My use-case: I would like to replace unraid with mergerfs. I have a NVME disk "pool" where all my first writes go to then later the NVME data gets moved to the slower spinning disks. The top-level folder in mergerfs would be "shared" via NFS/SMB to clients on the network.

The mergerfs documentation makes it clear that some sort of script to move files around with rsync needs to be setup; but what is unclear is:

Why do I need two mergerfs pools? can't the nvme disk just be in the same mount.fstab with all other disks slower disks; just listed first?
If I need 2 pools; how do these pools relate to eachother (or are supposed to be layered) when my goal is to transparently have a folder that's both in NVME + slower disks
If I can get away with a single mergerfs pool and the default writes always go to nvme; then a later script rsync moves the files to a different physical disk and frees up nvme space would this have the chance to break any files that may still be "OPEN" or being read off the nvme? It doesn't seem like the example script looks to only move files that are not in use or locked.

Answered by TheLinuxGuy

Nov 6, 2022

Okay, I think I got it now. The summary answer to my original question "Why does the documentation suggest 2 separate pools" is because the recommended scripts that run rsync need to use the second pool path and the create.policy for the secondary (slower disk) pool is what we want to use for the rsync/script.

If there was a single mergerfs pool, then we could not really 'load balance' the data off the NVME disks onto the slower drives via the magic of mergerfs. So the second pool is simply there to facilitate data moving operations (or when one wants to skip the cache completely, simply write to the specific path that mergerfs has for the slow disks only).

This is working for me now on m…

View full answer

trapexit · 2022-10-23T23:41:52Z

trapexit
Oct 23, 2022
Maintainer

Because mergerfs doesn't support moving of files like that. Of course you can change the policy and listings but then what? Once the SSDs fill you need the files moved to the slower tier. There isn't a built in feature yet for such things. Hence rsync and multiple pools.
One pool is all branches and one is just the slower branches.
POSIX filesystems don't work like FAT or NTFS. Locks are almost always advisory and unlinking a open'ed file is totally fine.

The reason for 2 pools is because you have to answer the question: "How do you choose where a file is created when moving them from the fast tier to the slow tier?" Using mergerfs as the answer is an obvious one. You set the slow tier pool create policies and just write to them.

3 replies

TheLinuxGuy Nov 1, 2022
Author

Thank for you this feedback. Yes, I am learning some concepts as I endeavor with building my own 'unraid' replacement.

I'm documenting my setup in https://github.com/TheLinuxGuy/free-unraid/blob/main/mergerfs.md if you don't mind taking a look at my /etc/fstab - I think it is implementing your feedback.

I haven't had time to do a lot of performance benchmarks, but I am seeing some slowness when writing to mergerfs mount using 'fio' locally on the same system.

Do you have experience using fio for benchmarking mergerfs? If so do you have any 'bare-minimum, most performant' /etc/fstab setting for me to try? I was reading your documentation on performance and have tried a few things recommended there like disabling security and whatnot - the performance penalty seems to be 50% when using mergerfs mount (vs. zfs pool original mount /cache)

This is the hybrid/cache pool

/cache:/mnt/slow-storage /mnt/cached fuse.mergerfs defaults,nonempty,allow_other,use_ino,noforget,inodecalc=path-hash,security_capability=false,cache.files=partial,category.create=lfs,moveonenospc=true,dropcacheonclose=true,minfreespace=4G,fsname=mergerfs 0 0

trapexit Nov 1, 2022
Maintainer

I really don't have any more to say outside what I already explain in the docs.

There is no "bare-minimum, most performant" setup. Everything depends on usage patterns, kernel support, etc. You have to define a specific metric and then I can comment.

No, I don't use fio for testing but if you articulate specifically how you have it setup I can comment.

TheLinuxGuy Nov 4, 2022
Author

Okay. Here's my benchmark setup and comparison notes/results : https://github.com/TheLinuxGuy/free-unraid/blob/main/performance_benchmarks.md

Note my surprising results. mergerfs + ZFS = performance penalty. If I use btrfs instead I only see 15% performance penalty. Would you mind trying this same test yourself and let me know if I should file a bug for the ZFS behavior (I assume its unexpected)?

trapexit · 2022-11-04T12:52:51Z

trapexit
Nov 4, 2022
Maintainer

There isn't going to be any "bugs" per se. Performance is always going to differ depending on a number of factors as I mention in the docs. I don't know enough about fio to comment on what might be the issue. I will have to look at the code and/or IO access patterns. It could simply be that ZFS has higher latency for some commands and since certain actions are sequential higher latency means lower throughput performance.

1 reply

TheLinuxGuy Nov 5, 2022
Author

After some more debugging, I found a newer linux kernel and upgrading to ZFS 2.1.4 the performance penalty observed in a mergerfs+zfs configuration is now gone. Thanks for the pointer here.

A final ask; here is my "tiered storage with mergerfs" setup mind taking a look if this makes sense

# mergefs - stich all slow disks.
/mnt/disk* /mnt/slow-storage fuse.mergerfs defaults,nonempty,allow_other,use_ino,category.create=eplus,cache.files=off,moveonenospc=true,dropcacheonclose=true,minfreespace=300G,fsname=mergerfs 0 0
# mergerfs - fast nvme cache w/ NFS settings.
/cache:/mnt/slow-storage /mnt/cached fuse.mergerfs defaults,nonempty,allow_other,use_ino,noforget,inodecalc=path-hash,security_capability=false,cache.files=partial,category.create=lfs,moveonenospc=true,dropcacheonclose=true,minfreespace=4G,fsname=mfs-cache 0 0

I think what I need to do is create the top-level folders in "/cache" ( "movies", "tv", "iso") this way it will ensure any new writes, say file copy of file "movies/second_level_folder/third_level/file.mkv" will be stored in the /cache drive - it is my understanding that the /cache disk does not need to have the entire branch hierarchy on the SSD. In other words, /cache/movies/second_level_folder/third_level/" folder does not need to exist. mergerfs will create the folder hierarchy that's missing "/second_level_folder/third_level/" when writing the file.mkv to SSD (/cache) storage.

My sync process (via rsync probably) can then "sanitize" the SSD /cache drives by moving everything inside "movies/." to the slower disks /mnt/slow-storage

The /mnt/slow-storage disks - will fill up data among the disks on equal terms eplus (existing path, least used space), it does not matter yet if /mnt/slow-storage/movies folder exists in drive /dev/disk{1...4} since mergerfs would create the folders for me, is this right?

In closing a newly formatted /dev/disk5 gets added to /mnt/slow-storage it will then have the "least used space" out of all /dev/disk* and so mergerfs will write the full path it needs to get the file "movies/second_level_folder/third_level/file.mkv" onto /dev/disk5/movies/second_level_folder/third_level/file.mkv

trapexit · 2022-11-05T21:22:21Z

trapexit
Nov 5, 2022
Maintainer

Glad to hear updates helped. FUSE and ZFS change somewhat regularly and versions can really matter.

re: creating paths... only if you're using config that restricts creation by path. "ep*" policies. See the docs for details on policies. To be clear... policies choose the branch. The path creation comes after based on selection. It's not that policies decide what gets created. It chooses where. If it doesn't exist it is always created after that selection.

1 reply

TheLinuxGuy Nov 6, 2022
Author

To be clear... policies choose the branch. The path creation comes after based on selection.

Just so I don't get confused, the policy that "finds" or "selects which branch" within the mergerfs mount (e.g: /dev/disk2 vs /dev/disk1) is category.create and it is not the category.search policy - correct?

or is it implied (in some cases only)?

Example: category.create=lfs is described as "Search: Same as eplfs. Action: Same as eplfs. " so by setting a single variable setting category.create=lfs it sets the other two values implicitly... This is only true for policies where the description contains a description for what is implied.

trapexit · 2022-11-06T03:00:18Z

trapexit
Nov 6, 2022
Maintainer

No.

https://github.com/trapexit/mergerfs#functions-categories-and-policies

All relevant functions have a policy. A category is just a set of functions.

0 replies

TheLinuxGuy · 2022-11-06T04:21:07Z

TheLinuxGuy
Nov 6, 2022
Author

Okay, I think I got it now. The summary answer to my original question "Why does the documentation suggest 2 separate pools" is because the recommended scripts that run rsync need to use the second pool path and the create.policy for the secondary (slower disk) pool is what we want to use for the rsync/script.

If there was a single mergerfs pool, then we could not really 'load balance' the data off the NVME disks onto the slower drives via the magic of mergerfs. So the second pool is simply there to facilitate data moving operations (or when one wants to skip the cache completely, simply write to the specific path that mergerfs has for the slow disks only).

This is working for me now on my test bed. The only concern I have left is that rsync -axqHAXWESR --preallocate --remove-source-files is silent command and doesn't seem to be verbose at all (even with --progress and stats flags enabled). But that's a separate issue from my original question from this discussion. Thanks again for all your support and clarifying things.

0 replies

trapexit · 2022-11-06T16:27:28Z

trapexit
Nov 6, 2022
Maintainer

Yes, you need some logic to choose where files are moved to. mergerfs doesn't yet have the ability to move files actively so two pools are used to give you placement logic and rsync used to transfer. That's it.

Yes, it is quiet because the command is told to be quiet. -q.... ie... --quiet. "suppress non-error messages".

https://linux.die.net/man/1/rsync

8 replies

TheLinuxGuy Nov 6, 2022
Author

I'm not sure exactly why you built it that way. With the manual removal and such after. rsync should handle all these concerns.

I took inspiration of the rsync options from unraid's own script (https://gist.github.com/fabioyamate/4087999)

I'm running a test workload scenario test; I have my old-NAS pushing rsync data via NFS mount to this test-server with mergerfs. I am observing the behavior of what happens when NVME 256GB disks get filled and I want to optimize for speed purge of the cache drives.

With the example rsync -axqHAXWESR --preallocate --remove-source-files running in parallel to NFS copies on a full NVME disk; the eviction of data out of the NVME seemed to take quite a long time.

Stopping the NFS file transfers and monitoring disk IO showed that the rsync -axqHAXWESR --preallocate --remove-source-files was not copying data at full available disk speed on the destination slow-disk.
I also noticed that rsync would only delete and free up space after the completion; which would be fine but the fact that my writes to the slow disk were 30% of the total write performance available it was affecting or slowing down the "clone" of my 30TB libraries into this test-server.

Perhaps it would be good for me to explain the script options and why they were chosen in the docs.

Yes this would help. I actually deconstructed your options using the rsync-manual and compared it to my options: https://github.com/TheLinuxGuy/free-unraid/blob/main/storage_tiered_cache.md

TheLinuxGuy Nov 6, 2022
Author

As I read more to learn about --inplace - looks like it is only safe to use when file is not open. The unraid script checks first if a file is open (using fuser command); then copies it if it is not open. This may be why? https://explainshell.com/explain?cmd=rsync+--inplace

trapexit Nov 6, 2022
Maintainer

Check if it is open prior to copying doesn't mean much. Files can be opened at any time. Unless using non-standard locking this is fake security. That's why rsync compares the files before and after transfer otherwise you can not guarantee anything. Not only that but inplace is not atomic by definition. As it says in the docs "WARNING: you should not use this option to update files that are being accessed by others, so be careful when choosing to use this for a copy."

inplace invites having a corrupted setup where you have 2 instances of the file in the pool and depending on your settings that file could be accessed while in the middle of being copied.

TheLinuxGuy Nov 6, 2022
Author

inplace invites having a corrupted setup where you have 2 instances of the file in the pool and depending on your settings that file could be accessed while in the middle of being copied.

This makes sense. Do you have any suggestions on how I can troubleshoot why rsync -axqHAXWESR --preallocate --remove-source-files is very slow? 40MB/s writes vs. 150 MB/s disk writes available even when no other IO except this command is running. I am not sure which one of these options may be causing the behavior and if it can be dropped.

trapexit Nov 6, 2022
Maintainer

I'll have to look them over. It's been years since I wrote all that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the documentation suggest "tiered caching" require 2 separate pools? I'm looking to replace unraid with mergerfs #1081

{{title}}

Replies: 6 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why does the documentation suggest "tiered caching" require 2 separate pools? I'm looking to replace unraid with mergerfs #1081

TheLinuxGuy Oct 23, 2022

Replies: 6 comments · 13 replies

trapexit Oct 23, 2022 Maintainer

TheLinuxGuy Nov 1, 2022 Author

trapexit Nov 1, 2022 Maintainer

TheLinuxGuy Nov 4, 2022 Author

trapexit Nov 4, 2022 Maintainer

TheLinuxGuy Nov 5, 2022 Author

trapexit Nov 5, 2022 Maintainer

TheLinuxGuy Nov 6, 2022 Author

trapexit Nov 6, 2022 Maintainer

TheLinuxGuy Nov 6, 2022 Author

trapexit Nov 6, 2022 Maintainer

TheLinuxGuy Nov 6, 2022 Author

TheLinuxGuy Nov 6, 2022 Author

trapexit Nov 6, 2022 Maintainer

TheLinuxGuy Nov 6, 2022 Author

trapexit Nov 6, 2022 Maintainer

TheLinuxGuy
Oct 23, 2022

Replies: 6 comments 13 replies

trapexit
Oct 23, 2022
Maintainer

TheLinuxGuy Nov 1, 2022
Author

trapexit Nov 1, 2022
Maintainer

TheLinuxGuy Nov 4, 2022
Author

trapexit
Nov 4, 2022
Maintainer

TheLinuxGuy Nov 5, 2022
Author

trapexit
Nov 5, 2022
Maintainer

TheLinuxGuy Nov 6, 2022
Author

trapexit
Nov 6, 2022
Maintainer

TheLinuxGuy
Nov 6, 2022
Author

trapexit
Nov 6, 2022
Maintainer

TheLinuxGuy Nov 6, 2022
Author

TheLinuxGuy Nov 6, 2022
Author

trapexit Nov 6, 2022
Maintainer

TheLinuxGuy Nov 6, 2022
Author

trapexit Nov 6, 2022
Maintainer