Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we have an IO issue on storage node? #170

Closed
benoit74 opened this issue Mar 5, 2024 · 6 comments
Closed

Do we have an IO issue on storage node? #170

benoit74 opened this issue Mar 5, 2024 · 6 comments
Labels
question Further information is requested

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

Global overview of the situation, only some food for thought for now

image

image

dev-library consumes about 100 to 180 IOPS (Read+Write) and 10 to 25 MB/s (Read+Write)

image

dev-library-generator is quite fast (2-3 mins) but consumes even more.

image

As a comparison, each library-data (prod serving ZIMs) consumes 3-4 IOPS in average (Read+Write, there is some peaks at 10 to 30) and a MB/s (Read+Write, there is some peaks at 4)

image

But rsyncd is even more intensive

image

image

One idea from @rgaudin: should we move prod library (most time sensitive application on this server) to a new server, with prod ZIMs mirrored from storage, where the service could be more quiet? (and only need about 4G of ZIMs, no need for the double copy, no need for dev ZIMs, nightlies, ...)

@benoit74 benoit74 added the question Further information is requested label Mar 5, 2024
@kelson42
Copy link
Contributor

kelson42 commented Mar 5, 2024

@benoit74 You mean 4TB of ZIM I guess for the prod library?

@rgaudin
Copy link
Member

rgaudin commented Mar 5, 2024

Current prod library is 4.23TiB

@kelson42
Copy link
Contributor

kelson42 commented Mar 5, 2024

One idea from @rgaudin: should we move prod library (most time sensitive application on this server) to a new server, with prod ZIMs mirrored from storage, where the service could be more quiet?

If there is no obvious technical optimisation in view, this looks like the logical approach. But we should have more buffer and probably count with around 8TB in at least Raid5. How much would that cost?

@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 5, 2024

If there is no obvious technical optimisation in view, this looks like the logical approach.

As I said, this is only food for thought for now. Having thought a bit (I just had a shower 🤣) I think we have other tracks to follow:

  • move to another server with SSD cache (NVMe preferably) ; dm-cache (LVM) seems to be a standard solution to handle this at the device mapper level (i.e. it has no knowledge of the filesystem, it work on blocks - 4064B usually) ; since we are mostly reading some parts of the ZIMs, mostly only fresh ones, and significantly only metadata, could be a valuable solution which would avoid to avoid a second machine to maintain, mirroring between the ZIM upload and the prod library, ...
  • limit access to the dev library (user/password, IP whitelisting, whatever): I'm quite sure some folks are abusing the dev library for whatever purpose, I do not have any other rational explanation of why it is consuming so much IO, at least it could be worth a try during few hours (days?) to observe the impact ; I'm not a big fan of limiting access to systems, but at some point when it impacts us more than normal, for a non-prod system, solutions must be found
  • cleanup the dev library (not sure it would be a game changer, dev is "only" 4.3TiB, whole hidden directory is 5.8T)

But we should have more buffer and probably count with around 8TB in at least Raid5. How much would that cost?

  • Raid probably not mandatory, it will be a mirror ; Raid5 is probably rather a maximum for perf+resilience
  • cost: 80-120$/month at Hetzner I would say, especially if we accept to run on "old" (~ 2020) machine ; no idea at Scaleway

@benoit74
Copy link
Collaborator Author

benoit74 commented Oct 4, 2024

Duplicate of #227, solved by moving workload to another cloud provider

@benoit74 benoit74 closed this as completed Oct 4, 2024
@benoit74 benoit74 unpinned this issue Oct 4, 2024
@rgaudin
Copy link
Member

rgaudin commented Oct 4, 2024

Ah I wanted to comment that we haven't enabled SSD but there's #246 just for that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants