-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 0.30 - Memory consumption explodes after a number of hours #10526
Comments
@bgrahamen thank you for the report, we will triage next week, but a quick ask: 0.28 started prioritizing pin roots (release notes) and that introduced slight memory increase when Would it be possible for you to switch |
This behavior was present in 0.29 but much much worse in regards to speed of consumption and subsequent OOM kill + restart of the service. For testing purposes I will see if I can change That being said we are not setting Routing.Strategy we are setting Reprovider.Strategy which is what I think you meant @lidel? |
That's not quite right. To make sure we're on the same page there are a few subsystems within kubo that may be relevant here:
So by advertising unpinned data all you're doing is telling the network that you have data in your blockstore that you might garbage collect in the future, so the only way you're fetching "any CID" is if you're already fetching "any CID" for people via them using something like a publicly exposed HTTP Gateway API. |
This is exactly what happens in our setup today. We also intend on setting But I need to resolve this memory consumption issue first as that is presently a blocker. |
I've switched the Routing.Strategy to flat and will update you tomorrow on memory usage |
I would say, given how the memory consumption/growth looks so far. That setting flat for the Reprovider strategy made things much worse. The blue line is our alert limit of 126GB, the process is OOM killed around 212-225 GB. Givent he rate of memory consumption with |
Could you post the profile information with the config changed as well? |
Trying to do that now. |
We can't capture a pprof when |
Maybe you can grab a smaller slice of the info, like the just the heap and memstats https://github.com/ipfs/kubo/blob/master/docs/debug-guide.md |
latest_capture_with_flat_set.tar.gz Attached are the outputs from these commands with goroutine dump: 30 second cpu profile: heap trace dump: memory statistics (in json, see "memstats" object): system information: |
Another interesting note. After this last OOM kill restart the memory consumption has now remained stable. Though req/s, throughput, and all other traffic metrics are effectively the same between periods of rapid memory consumption and OOM kill; and periods like now where it is steady at 30GB of memory used and slowly growing (like before we set |
Triage notes
|
Given QUIC is the most popular transport I wonder if we would be able to turn it off without impacting our services. It looks like go-libp2p might pull in quic-go 0.47 in the next update (via go get) -- libp2p/go-libp2p#2901 . @lidel I compiled kubo against a local copy of go-libp2p v0.36.4 that was updated to use quic-go v0.47.0. If it would be helpful for your investigation I can provide that bin here, or push my modified docker image up to a repo for you to grab and test. It seems to be running so far on my local setup, but my local setup is pretty much a test to see if it will even start. Docker image with kubo compiled against go-libp2p v0.36.4 with quic-go updated to v0.47 -- |
Quick update: We will be testing the following changes to our config in an effort to see if we can resolve the issue:
We are attempting to avoid having to turn off QUIC as a transport, or run the version compiled against an updated libp2p. The theory being is that if the issue is QUIC connections hanging around for too long, maybe reaping them and putting a limit on memory used by libp2p might help. These changes (with lower limits) are running on our test setup now and we are monitoring metrics for 24hrs before deploying them to our production setup. Test Setup Limits: |
Additional side note. With the testnet configuration in place I am seeing this warning on startup:
I assume (per this issue) -- Is there any "good known" values for those high/low water marks that don't interfere with Accelerated-DHT? |
Update: We have made is just about 4.5 days since we put these changes in place and the process is still running. Memory is still growing slowly over time but we have yet to see the run away memory consumption effect. It will be interesting to see what happens when is reaches the memory limit (120GB). If it is anything like out test setup it hopefully will reclaim a bunch of memory when it nears that limit and then work its way back up to the limit again. This remains to be seen but at present we are hopeful this occurs. |
Thank you for testing quic-go v0.47.0 @bgrahamen. Sidenotes:
|
@lidel -- On our test setup we have it set that low mostly to keep things as small as possible from an instance size/cost prospective. I will check to see if we can move things up to allow it at least 4GB or so of ram. I also not sure about the history of connection limits we have set. i agree they are very high but I haven't figured out why yet to be able to adjust them down with confidence. The routing change I will also look into as well. We have not tested the compiled version with 0.47 of quic-go yet. We are hoping to avoid doing that with settings changes until it can be pulled into libp2p officially. However, if needed we will certainly try it. |
Triage note: parking until Kubo 0.32.0-rc1 with latest quic-go is ready |
Interesting additional data perhaps. Our changes, we have not touched connection limits yet, has brought a very marked increase in stability. However, we have noticed something else crop up, perhaps hidden before due to the all the issues with memory exhaustion before. I am not sure how the interaction plays out in kubo, but HTTP API requests also seem to increase memory as well? We received a large spike of HTTP requests to our production IPFS in to two examples provided below. In both cases you can see the kubo process explode in terms of memory utilization. In the first example it died several times while under high memory pressure. I am still pulling logs here to see if it was OOM kill or something else. The second time it was able to recover, and the slowly start growing again (As it seems to do in our case). Top is HTTP requests sum (per 15 min)/Bottom is kubo memory utilization. |
@bgrahamen spikes in gateway traffic likely translate to increased libp2p/bitswap/quic activity, so nothing unexpected (but additional data point that more QUIC == more memory). We've just shipped go-libp2p 0.37 with quic-go v0.48.1 and it got merged into Kubo's |
Triage note: Profile in staging environment to see if this is quic, or something else. If this is quic then open libp2p issue if there is not one already. |
We have upgraded to v.32.1 and are monitoring memory consumption to see how the changed pulling in with the updated libp2p might impacting things (hopefully in a positive way) for us. |
Checklist
Installation method
ipfs-update or dist.ipfs.tech
Version
Config
Description
I recently upgraded from 0.29 to 0.30 to address a memory leak I was experiencing.
Everything seems to be going fine for the first 20 or so hours but then there was an explosion of memory consumption around the 21st hours. After a restart of the processes the behavior is the same though the time windows between stable memory consumption to explosion is not predictable.
When the process first starts up it consumes approximation 27GB of memory. While it is in "steady state" the memory consumption grows between 3GB and 7GB, at which point there is something that gives way and the process shoots to 197GB of memory and continues to grow until is it OOM killed. The box this is running on has 256GB of ram total.
Attached are the allocs, cpu, goroutines, and heap pprof files:
heap.tar.gz
goroutines.tar.gz
cpu.tar.gz
alloc.tar.gz
goroutine stacks file:
goroutines-stacks.tar.gz
Graph is behavior (memory utilization over 24 hour period after a process restart):
The text was updated successfully, but these errors were encountered: