Do we have an IO issue on storage node? #170

benoit74 · 2024-03-05T17:31:53Z

Global overview of the situation, only some food for thought for now

dev-library consumes about 100 to 180 IOPS (Read+Write) and 10 to 25 MB/s (Read+Write)

dev-library-generator is quite fast (2-3 mins) but consumes even more.

As a comparison, each library-data (prod serving ZIMs) consumes 3-4 IOPS in average (Read+Write, there is some peaks at 10 to 30) and a MB/s (Read+Write, there is some peaks at 4)

But rsyncd is even more intensive

One idea from @rgaudin: should we move prod library (most time sensitive application on this server) to a new server, with prod ZIMs mirrored from storage, where the service could be more quiet? (and only need about 4G of ZIMs, no need for the double copy, no need for dev ZIMs, nightlies, ...)

kelson42 · 2024-03-05T18:02:23Z

@benoit74 You mean 4TB of ZIM I guess for the prod library?

rgaudin · 2024-03-05T18:18:28Z

Current prod library is 4.23TiB

kelson42 · 2024-03-05T18:36:15Z

One idea from @rgaudin: should we move prod library (most time sensitive application on this server) to a new server, with prod ZIMs mirrored from storage, where the service could be more quiet?

If there is no obvious technical optimisation in view, this looks like the logical approach. But we should have more buffer and probably count with around 8TB in at least Raid5. How much would that cost?

benoit74 · 2024-03-05T20:15:22Z

If there is no obvious technical optimisation in view, this looks like the logical approach.

As I said, this is only food for thought for now. Having thought a bit (I just had a shower 🤣) I think we have other tracks to follow:

move to another server with SSD cache (NVMe preferably) ; dm-cache (LVM) seems to be a standard solution to handle this at the device mapper level (i.e. it has no knowledge of the filesystem, it work on blocks - 4064B usually) ; since we are mostly reading some parts of the ZIMs, mostly only fresh ones, and significantly only metadata, could be a valuable solution which would avoid to avoid a second machine to maintain, mirroring between the ZIM upload and the prod library, ...
limit access to the dev library (user/password, IP whitelisting, whatever): I'm quite sure some folks are abusing the dev library for whatever purpose, I do not have any other rational explanation of why it is consuming so much IO, at least it could be worth a try during few hours (days?) to observe the impact ; I'm not a big fan of limiting access to systems, but at some point when it impacts us more than normal, for a non-prod system, solutions must be found
cleanup the dev library (not sure it would be a game changer, dev is "only" 4.3TiB, whole hidden directory is 5.8T)

But we should have more buffer and probably count with around 8TB in at least Raid5. How much would that cost?

Raid probably not mandatory, it will be a mirror ; Raid5 is probably rather a maximum for perf+resilience
cost: 80-120$/month at Hetzner I would say, especially if we accept to run on "old" (~ 2020) machine ; no idea at Scaleway

benoit74 · 2024-10-04T07:27:02Z

Duplicate of #227, solved by moving workload to another cloud provider

rgaudin · 2024-10-04T07:35:12Z

Ah I wanted to comment that we haven't enabled SSD but there's #246 just for that :)

benoit74 added the question Further information is requested label Mar 5, 2024

rgaudin mentioned this issue Mar 5, 2024

Prod library generation is taking way too long #169

Closed

kelson42 pinned this issue May 25, 2024

kelson42 mentioned this issue May 25, 2024

Why dev.library.kiwix.org is regularly extremly slow? #194

Closed

benoit74 closed this as completed Oct 4, 2024

benoit74 unpinned this issue Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we have an IO issue on storage node? #170

Do we have an IO issue on storage node? #170

benoit74 commented Mar 5, 2024

kelson42 commented Mar 5, 2024

rgaudin commented Mar 5, 2024

kelson42 commented Mar 5, 2024

benoit74 commented Mar 5, 2024

benoit74 commented Oct 4, 2024

rgaudin commented Oct 4, 2024

Do we have an IO issue on storage node? #170

Do we have an IO issue on storage node? #170

Comments

benoit74 commented Mar 5, 2024

kelson42 commented Mar 5, 2024

rgaudin commented Mar 5, 2024

kelson42 commented Mar 5, 2024

benoit74 commented Mar 5, 2024

benoit74 commented Oct 4, 2024

rgaudin commented Oct 4, 2024