Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement Coraza-SPOA #103

Merged
merged 21 commits into from
Nov 7, 2024
Merged

Reimplement Coraza-SPOA #103

merged 21 commits into from
Nov 7, 2024

Conversation

fionera
Copy link
Contributor

@fionera fionera commented Feb 18, 2024

Another go at updating coraza-spoa to a the up to date coraza version and a faster spop library. It also uses upstream e2e tests


Q: Why rewrite instead of incremental changes?
A: I didn't had time nor motivation to figure out why some things are how they are. It was just easier to do it like this.


TODOs:

  • Config files
  • Allow listening to a unix-socket
  • Correct logger setup
  • Figure out how to work with errors inside the handler
  • Fixup docker containers

Other PRs that are similar:

Closes #84 #85 #90 #88 #75

@fionera fionera marked this pull request as draft February 18, 2024 21:46
@fionera fionera requested a review from sts February 18, 2024 21:53
jptosso
jptosso previously approved these changes Feb 19, 2024
@jptosso jptosso self-requested a review February 19, 2024 08:23
with:
go-version: '1.21'

- name: setup environment
Copy link
Member

@jcchavezs jcchavezs Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really avoid going in this direction, this assumes a haproxy is running in local which is going to be a big friction to those willing to contribute. Personally I would never install haproxy locally to debug a problem so I suggest we spin up a haproxy instance using docker with testcontainers api (see https://pkg.go.dev/github.com/testcontainers/testcontainers-go) or just docker-compose so the test is selfcontained and does not assume I have something installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly did this without using containers (atleast inside the CI) as its faster to run. Locally and in production I would always recommend running without docker as the performance difference is huge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also I dont have docker on my machine since its just very slow on macos)

Copy link
Member

@jcchavezs jcchavezs Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer to have something that is portable. It is OK if we want CI to run haproxy locally but we also need to provide a way to run it with docker. Again here we prioritize user experience if I have to install locally haproxy to contribute I'd rather not contribute hence we use containers. For advanced users they can definitively use local haproxy.

"github.com/dropmorepackets/haproxy-go/pkg/testutil"
)

const directives = `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use go:embed for this, also in general we would like to keep a consistent http e2e suite by suing https://github.com/corazawaf/coraza/tree/main/http/e2e which is being used in coraza http handler, caddy and coraza-proxy-wasm. Running it is as easy as spin up an instance and run a go command, see https://github.com/corazawaf/coraza-caddy/blob/main/magefile.go#L74

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should I use embed for a single string? Inside the coraza tests its done the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah I can migrate the e2e tests to run the command but I am not a fan of running commands and basically ignoring the go test framework. If the Upstream would properly expose the tests as a list we could run these in parallel and reduce testing times while also reducing the amount of external dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue on coraza itself to expose this and make it easier for other repos to use the e2e test suite.
corazawaf/coraza#1006

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for a coraza release to fix this up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better you use master for now.

internal/application.go Outdated Show resolved Hide resolved
Copy link
Member

@jcchavezs jcchavezs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I added a few comments and I also think we are missing some unit tests, specifically for functions around parsing request elements. Great work!

@fionera fionera force-pushed the rewrite branch 3 times, most recently from ee4e7f7 to f0db161 Compare February 19, 2024 19:39
@fionera
Copy link
Contributor Author

fionera commented Feb 25, 2024

Sorry that I didn't update on this again but I am currently working on getting a talk done. Will try to get the other points done by the end of next week

@fionera
Copy link
Contributor Author

fionera commented Mar 11, 2024

I think this can be reviewed and tested now. There is still a fd leak when reloading and changing the logger, also when there is a reload with logger change and then an a reload that reverts this, the old file can be raced. There should probably a small wrapper around the logger to provide prevent multiple writers to the same file, but I guess this can wait.

@fionera fionera marked this pull request as ready for review March 11, 2024 00:58
@fionera fionera requested a review from jcchavezs March 11, 2024 00:58
@fzipi
Copy link
Member

fzipi commented Jun 12, 2024

@fionera Can you solve the conflicts here?

@fionera
Copy link
Contributor Author

fionera commented Jun 14, 2024

Sure will do that

@fionera
Copy link
Contributor Author

fionera commented Jun 21, 2024

Updated :)

fionera added 6 commits June 21, 2024 22:04
This allows us to not rely on the http header,
but instead use a variable inside the transaction.
This does leak fds as the logging target never gets closed,
this has to be fixed before release but this will work for now.
It is not correctly configured anyway and
we have go specific linters enabled
I don't know why I didn't just close it after leaving the function.
There is no other use of the fd...
@fionera
Copy link
Contributor Author

fionera commented Jun 21, 2024

And now without missing to remove an old file 🗡️

@mmilitzer
Copy link

Since we're having stability issues with the current main branch (See: #97) I've been following this rewrite closely hoping it would get merged and fix our issues.

But since merging is stalled since a while I got curious and went ahead with deploying the rewrite branch anyway and putting some of our production traffic against it for a test. But unfortunately, this didn't go well - I couldn't make any conclusions on the stability vs. main because the rewrite branch has a massive memory leak issue and so I couldn't keep it running it for long. So after less than two hours the coraza-spoa daemon's memory use grew above 4 GB and additionally it also used a lot more CPU (about twice as much or more) than the main branch version. And so I ended the experiment again.

My setup uses the CRS v4.7.0 plus a few custom exclusion rules to deal with false positives. Response checking is disabled. Config options and rules were the same when doing this comparison, so the only thing that differs is that one time I ran the coraza-spoa daemon binary built from the rewrite branch and the other time that from the main branch.

First, here's a graph of how the overall system's memory consumption developed after enabling the rewrite branch coraza-spoa daemon. There's an obvious memory leak which forced me to stop the experiment and downgrade to the main branch version again after ~2 hrs. And, after switching back to main, the memory use instantly became stable again:
System_Memory_Use

And also for CPU usage there's a clear difference, so see the CPU consumption graph below that depicts the CPU usage of the coraza-spoa daemon process. While running the rewrite branch version the CPU consumption is consistently much higher and after downgrading to the main branch version it drops again to a much lower level:
coraza-spoa-CPU_Usage

What I further noticed is that I got very frequently messages like the following below in our /var/log/messages when running the rewrite branch version:

coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection

With the main branch coraza-spoa daemon we're also sometimes seeing such "broken pipe" log messages but they are very rare, so they happen just every couple of days. But while running the rewrite branch version this happened all the time, like there were some such messages logged almost every minute. So I'm wondering if there are also functional issues then (beyond just the resource consumption problems). Or if maybe these "broken pipe" errors also cause the memory leak (perhaps under these error conditions memory isn't properly released?).

But so unfortunately, I have no clue what the root cause of these issues might be. Perhaps, it's not even the code in the rewrite branch itself, so the rewrite branch also brings in a lot of newer dependencies - hence the memory leak could also be caused by a dependency (though less likely). So I can only share these observations from our first real-world test now and unfortunately it seems the rewrite branch isn't quite ready for general use yet. But hopefully these remaining issues can still be solved now...

@csuka
Copy link

csuka commented Sep 29, 2024

Or if maybe these "broken pipe" errors also cause the memory leak.

Perhaps this could help you out to find the root cause of the issue:

"A "broken pipe" error typically indicates that a process was attempting to write data to another process or socket, but the receiving process was closed or became unavailable before the data transfer was complete. This can occur for various reasons, including:

Network Issues: If the communication was happening over a network, a drop in network connectivity or a closed connection can result in a broken pipe.
Process Termination: If a receiving process crashes, exits, or is terminated unexpectedly, any ongoing write operation to that process will fail with a broken pipe error.
Resource Limits: Resource limitations like memory or file descriptors reaching their maximum capacity can cause one process to close, resulting in a broken pipe.
Timeouts: The receiving process may have a timeout set and, upon expiration, could close the connection while data is still being sent."

@fionera
Copy link
Contributor Author

fionera commented Sep 29, 2024

Did you check with before the fixes I pushed a couple days ago? I found the resource leak which was because of a wrong usage of defer statements. The brokenn pipe issues are happening when haproxy decides to close the connection because of either too long evaluation times or too many requests. I am currently investivating why exactly haproxy does that and will keep you updated.

@mmilitzer
Copy link

Did you check with before the fixes I pushed a couple days ago?

Yes, I used the latest version available (commit 6ea3e43)

@fionera
Copy link
Contributor Author

fionera commented Sep 30, 2024

I cannot reproduce this behaviour on my side. Can you run the newest commit with the memory profile flag?

@fionera
Copy link
Contributor Author

fionera commented Sep 30, 2024

Did you check with before the fixes I pushed a couple days ago?

Yes, I used the latest version available (commit 6ea3e43)

Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)

@mmilitzer
Copy link

Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)

I've tried the latest rewrite branch code and I have partially good news: The memory leak is gone, so I couldn't reproduce this anymore. Also the CPU usage was much better and on around the same level than the version from current main branch. So it looks like the memory leak previously was also eating CPU cycles and that's why CPU usage was higher. So the resource usage issues seem to be solved now, not worse than main branch anymore.

However, I'm still seeing "broken pipe" etc. output in /var/log/messages frequently which doesn't happen when running the code from the main branch:

2024-09-30T17:26:00.221389+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too 
2024-09-30T17:26:00.221466+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019387+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: write: broken pipe
2024-09-30T17:27:46.019485+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019564+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: use of closed network connection
2024-09-30T17:29:05.046278+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:05.046511+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683271+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683348+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 write tcp 127.0.0.1:9000->127.0.0.1:59350: write: broken pipe

So this is just a short snippet from the log but it's those same three errors ("frame is too big", "use of closed network connection" and "broken pipe") that repeat again and again. And "frame is too big" seems new, so I think I didn't see this before when I tested the earlier version that still had the memory leak.

NOTE: I tested this all still with go 1.21 because I don't have a newer version on AlmaLinux 9 and was too lazy to build the newer one from source. So I've reverted your latest commit d0ba9f8 and also changed the go version in go.mod back to 1.21 to make it work. And I don't know if that's a problem, so if there's anything why go >= 1.23.1 is really required...

@@ -1,27 +1,30 @@
module github.com/corazawaf/coraza-spoa

go 1.19
go 1.23.1
Copy link

@superstes superstes Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With go 1.23.1 the application fails to start for me.
With 1.22.3 (as used by the waf-core https://github.com/corazawaf/coraza/blob/main/go.mod#L3) it works without any config changes

Error:
Oct 06 16:43:57 lb01 coraza[79533]: {"level":"fatal","error":"initializing application '\\x00': invalid WAF config from string: failed to readfile: read /etc/coraza-spoa/coraza.conf: invalid argument","time":"2024-10-06T16:43:57+02:00","message":"Failed creating applications"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same error with go 1.23.2 :
9:45PM FTL Failed creating applications error="initializing application '\x00': invalid WAF config from string: failed to readfile: read /opt/coraza-spoa/etc/coraza.conf: invalid argument"

Copy link
Contributor Author

@fionera fionera Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Will try to reproduce

@superstes
Copy link

superstes commented Oct 23, 2024

Had it running the last 20 days without any major issues: (with go 1.22.3)

  • Only a few errors per day in relation to the hits/requests
    image
  • Memory (100% = 4GB) and CPU (3 cores) utilization are stable:
    image

@dknight-bg
Copy link

I get an error when result processing is enabled ( response_check: true ) :

9:49PM PNC Error handling request error="response id is empty"

spoe-agent coraza-agent
    messages coraza-req coraza-res
    option var-prefix coraza
    option set-on-error error
    timeout hello      2s
    timeout idle       2m
    timeout processing 500ms
    use-backend coraza-spoa
    log global

spoe-message coraza-req
    args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body
    event on-frontend-http-request

spoe-message coraza-res
    args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body
    event on-http-response`

@fionera
Copy link
Contributor Author

fionera commented Oct 23, 2024

I get an error when result processing is enabled ( response_check: true ) :

9:49PM PNC Error handling request error="response id is empty"


spoe-agent coraza-agent

    messages coraza-req coraza-res

    option var-prefix coraza

    option set-on-error error

    timeout hello      2s

    timeout idle       2m

    timeout processing 500ms

    use-backend coraza-spoa

    log global



spoe-message coraza-req

    args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body

    event on-frontend-http-request



spoe-message coraza-res

    args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body

    event on-http-response`

The ID argument is using the wrong var. It would have to be txn.coraza.id

@dknight-bg
Copy link

I get an error when result processing is enabled ( response_check: true ) :
9:49PM PNC Error handling request error="response id is empty"

The ID argument is using the wrong var. It would have to be txn.coraza.id

Thank you, it works correctly with tnx.coraza.id parameter
I got confused by the sample configuration example/haproxy/coraza.cfg from this branch

@jptosso
Copy link
Member

jptosso commented Nov 7, 2024

Merged. Work will continue from here; v0.1 was created to keep the old version. This will be tagged as 1.0.0-experimental

@jptosso jptosso merged commit 577b335 into main Nov 7, 2024
0 of 5 checks passed
@jptosso jptosso deleted the rewrite branch November 7, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants