-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive memory and CPU usage #97
Comments
Hi Michael,
Thanks for the detailed description. I think the first place to look at is
if there is a memory leak somewhere. Maybe emitting go metrics at this
point is the right thing to do. Then we can start looking deeper. Another
option is to deploy coraza with go profiling enabled.
…On Thu, 21 Dec 2023, 16:22 Michael Militzer, ***@***.***> wrote:
Hi,
we would like to use coraza-spoa in production but unfortunately as soon
as we put production traffic on it, after a few hours the coraza-spoa
daemon "goes nuts", starts using up all available CPU resources and shows a
rapidly increasing memory consumption up to consuming tens of GBs of RAM.
The system then starts swapping and becomes entirely unresponsive until the
OOM killer finally kills the coraza-spoa daemon. And this happens quite
frequently for us, like several times a day.
We run multiple haproxy load-balancing servers and what I observed is that
often when coraza-spoa on one "goes nuts" the issue propagates also to
further load-balancers, so the coraza daemon also on other servers starts
excessively using memory and cpu almost at the same time. My conclusion
therefore would be that the problem is externally triggered, so caused by
certain (malicious?) requests that come in. Only this would explain the
time-correlation of the OOM event among multiple load-balancer servers.
However, I went through our access logs at the respective times and
couldn't find any obviously unusual or suspicious requests. There also were
not unusually many requests at those times, so it's not an issue with
general load or request rate. But still, there must be something
apparently, so certain problematic requests that trigger the issue and
cause the excessive resource use. I also checked the coraza logs but there
is absolutely nothing logged when it happens. So whatever it is, coraza
apparently didn't attempt to block those requests (or wasn't even able to
get that far at all in processing the request).
So I don't really have a handle on how to get down to the root of the
problem. I don't even know whether the issue is in coraza itself or the
problem is unique to the haproxy SPOE integration. So if you have any ideas
what might be the problem or how we could debug it more effectively, please
let me know.
Thanks!
Michael
—
Reply to this email directly, view it on GitHub
<#97>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXOYAUE3OSL4I7CE6DOMZLYKRH4ZAVCNFSM6AAAAABA6R6QO6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2TENRSGIYTANI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi José, thanks for your reply. I don't think that it is a "classic" memory leak because the memory consumption does not gradually increase over time. Rather, it is a very sudden event. So normally the coraraz-spoa daemon actually consumes very little memory but when "it goes nuts" it's suddenly allocating GBs of memory within seconds and all available system memory gets quickly used up until the process gets killed by the OOM killer. Also, as mentioned, those events seem to happen (when they happen) with high time correlation across multiple of our load-balancers. So I'm suspecting it may be kind of a regexp denial of service, so that certain requests cause coraza to go into an endless loop. Or cause a deadlock so that processing of requests get stuck and just the queue with new incoming requests still gets filled up (but never emptied anymore) so that memory consumption just keeps increasing without limit (but I don't know the internal architecture of coraza and coraza-spoa - so would something like this be possible at all?). I can try enabling go with profiling but if it's such an issue as I suspect then the profiling information may not help us much, right? But is there some way to debug on which request coraza got stuck (if that's what is happening)? Most of the time I can't witness the issue "live" because by the time I receive an alert about the excessive memory usage from our monitoring the coraza-spoa daemon then got already killed until I can connect to the machine. So this really happens quite fast... |
From your description, the symptoms looks exactly like bug in HAProxy v2.9.1. If you used HAProxy version 2.9.1, please downgrade to 2.9.0 because of this issue: haproxy/haproxy#2395. It will be fixed in next 2.9.x version. |
Thanks for your suggestion. But we're still on haproxy v2.0.34 and not v2.9. Also, in our case it's not the haproxy process that uses 100% cpu but it's the coraza-spoa daemon process that "goes nuts" and consumes all available cpu and memory. Hence I don't think that the issue is in haproxy but it's rather either coraza itself or the coraza-spoa wrapper... |
I'm now seeing also the following stack traces when it crashes and it seems to be always the same crash reason. So it looks like it's actually crashing inside corazawaf itself then, right?
|
What OS are you running and what version of GO? |
AlmaLinux 9.4 and go1.21.13 (Red Hat 1.21.13-3.el9_4) |
This should be fixed now, but I still want to put some more love into it. Right now we are not limiting the creation of transactions which can still create "memory leaks". I put that in quotes because we are pooling that memory, but since we will never use them afterwards they are lost. Afaik a sync.Pool will never shrink (which is why it exists in the first place) but we need a pool with an upper limit + some spare to allow us to limit the transactions properly |
Hi,
we would like to use coraza-spoa in production but unfortunately as soon as we put production traffic on it, after a few hours the coraza-spoa daemon "goes nuts", starts using up all available CPU resources and shows a rapidly increasing memory consumption up to consuming tens of GBs of RAM. The system then starts swapping and becomes entirely unresponsive until the OOM killer finally kills the coraza-spoa daemon. And this happens quite frequently for us, like several times a day.
We run multiple haproxy load-balancing servers and what I observed is that often when coraza-spoa on one "goes nuts" the issue propagates also to further load-balancers, so the coraza daemon also on other servers starts excessively using memory and cpu almost at the same time. My conclusion therefore would be that the problem is externally triggered, so caused by certain (malicious?) requests that come in. Only this would explain the time-correlation of the OOM event among multiple load-balancer servers.
However, I went through our access logs at the respective times and couldn't find any obviously unusual or suspicious requests. There also were not unusually many requests at those times, so it's not an issue with general load or request rate. But still, there must be something apparently, so certain problematic requests that trigger the issue and cause the excessive resource use. I also checked the coraza logs but there is absolutely nothing logged when it happens. So whatever it is, coraza apparently didn't attempt to block those requests (or wasn't even able to get that far at all in processing the request).
So I don't really have a handle on how to get down to the root of the problem. I don't even know whether the issue is in coraza itself or the problem is unique to the haproxy SPOE integration. So if you have any ideas what might be the problem or how we could debug it more effectively, please let me know.
Thanks!
Michael
The text was updated successfully, but these errors were encountered: