Reimplement Coraza-SPOA #103

fionera · 2024-02-18T21:45:44Z

Another go at updating coraza-spoa to a the up to date coraza version and a faster spop library. It also uses upstream e2e tests

Q: Why rewrite instead of incremental changes?
A: I didn't had time nor motivation to figure out why some things are how they are. It was just easier to do it like this.

TODOs:

Config files
Allow listening to a unix-socket
Correct logger setup
Figure out how to work with errors inside the handler
Fixup docker containers

Other PRs that are similar:

Closes #84 #85 #90 #88 #75

.github/workflows/container-image.yaml

jcchavezs · 2024-02-19T08:57:34Z

.github/workflows/test.yaml

+        with:
+          go-version: '1.21'
+
+      - name: setup environment


I'd really avoid going in this direction, this assumes a haproxy is running in local which is going to be a big friction to those willing to contribute. Personally I would never install haproxy locally to debug a problem so I suggest we spin up a haproxy instance using docker with testcontainers api (see https://pkg.go.dev/github.com/testcontainers/testcontainers-go) or just docker-compose so the test is selfcontained and does not assume I have something installed.

I explicitly did this without using containers (atleast inside the CI) as its faster to run. Locally and in production I would always recommend running without docker as the performance difference is huge.

(Also I dont have docker on my machine since its just very slow on macos)

I really prefer to have something that is portable. It is OK if we want CI to run haproxy locally but we also need to provide a way to run it with docker. Again here we prioritize user experience if I have to install locally haproxy to contribute I'd rather not contribute hence we use containers. For advanced users they can definitively use local haproxy.

jcchavezs · 2024-02-19T09:00:19Z

internal/e2e_test.go

+	"github.com/dropmorepackets/haproxy-go/pkg/testutil"
+)
+
+const directives = `


Please use go:embed for this, also in general we would like to keep a consistent http e2e suite by suing https://github.com/corazawaf/coraza/tree/main/http/e2e which is being used in coraza http handler, caddy and coraza-proxy-wasm. Running it is as easy as spin up an instance and run a go command, see https://github.com/corazawaf/coraza-caddy/blob/main/magefile.go#L74

Why should I use embed for a single string? Inside the coraza tests its done the same way.

But yeah I can migrate the e2e tests to run the command but I am not a fan of running commands and basically ignoring the go test framework. If the Upstream would properly expose the tests as a list we could run these in parallel and reduce testing times while also reducing the amount of external dependencies

I opened an issue on coraza itself to expose this and make it easier for other repos to use the e2e test suite.
corazawaf/coraza#1006

Waiting for a coraza release to fix this up

Maybe it is better you use master for now.

internal/application.go

jcchavezs

Looking good, I added a few comments and I also think we are missing some unit tests, specifically for functions around parsing request elements. Great work!

fionera · 2024-02-25T17:46:48Z

Sorry that I didn't update on this again but I am currently working on getting a talk done. Will try to get the other points done by the end of next week

fionera · 2024-03-11T00:58:25Z

I think this can be reviewed and tested now. There is still a fd leak when reloading and changing the logger, also when there is a reload with logger change and then an a reload that reverts this, the old file can be raced. There should probably a small wrapper around the logger to provide prevent multiple writers to the same file, but I guess this can wait.

...

fzipi · 2024-06-12T12:42:28Z

@fionera Can you solve the conflicts here?

fionera · 2024-06-14T07:56:04Z

Sure will do that

fionera · 2024-06-21T19:56:19Z

Updated :)

This allows us to not rely on the http header, but instead use a variable inside the transaction.

This does leak fds as the logging target never gets closed, this has to be fixed before release but this will work for now.

It is not correctly configured anyway and we have go specific linters enabled

I don't know why I didn't just close it after leaving the function. There is no other use of the fd...

fionera · 2024-06-21T20:05:09Z

And now without missing to remove an old file 🗡️

mmilitzer · 2024-09-29T09:07:12Z

Since we're having stability issues with the current main branch (See: #97) I've been following this rewrite closely hoping it would get merged and fix our issues.

But since merging is stalled since a while I got curious and went ahead with deploying the rewrite branch anyway and putting some of our production traffic against it for a test. But unfortunately, this didn't go well - I couldn't make any conclusions on the stability vs. main because the rewrite branch has a massive memory leak issue and so I couldn't keep it running it for long. So after less than two hours the coraza-spoa daemon's memory use grew above 4 GB and additionally it also used a lot more CPU (about twice as much or more) than the main branch version. And so I ended the experiment again.

My setup uses the CRS v4.7.0 plus a few custom exclusion rules to deal with false positives. Response checking is disabled. Config options and rules were the same when doing this comparison, so the only thing that differs is that one time I ran the coraza-spoa daemon binary built from the rewrite branch and the other time that from the main branch.

First, here's a graph of how the overall system's memory consumption developed after enabling the rewrite branch coraza-spoa daemon. There's an obvious memory leak which forced me to stop the experiment and downgrade to the main branch version again after ~2 hrs. And, after switching back to main, the memory use instantly became stable again:

And also for CPU usage there's a clear difference, so see the CPU consumption graph below that depicts the CPU usage of the coraza-spoa daemon process. While running the rewrite branch version the CPU consumption is consistently much higher and after downgrading to the main branch version it drops again to a much lower level:

What I further noticed is that I got very frequently messages like the following below in our /var/log/messages when running the rewrite branch version:

coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection
coraza-spoa[3659835]: 2024/09/28 16:13:02 write tcp 127.0.0.1:9000->127.0.0.1:50932: use of closed network connection

With the main branch coraza-spoa daemon we're also sometimes seeing such "broken pipe" log messages but they are very rare, so they happen just every couple of days. But while running the rewrite branch version this happened all the time, like there were some such messages logged almost every minute. So I'm wondering if there are also functional issues then (beyond just the resource consumption problems). Or if maybe these "broken pipe" errors also cause the memory leak (perhaps under these error conditions memory isn't properly released?).

But so unfortunately, I have no clue what the root cause of these issues might be. Perhaps, it's not even the code in the rewrite branch itself, so the rewrite branch also brings in a lot of newer dependencies - hence the memory leak could also be caused by a dependency (though less likely). So I can only share these observations from our first real-world test now and unfortunately it seems the rewrite branch isn't quite ready for general use yet. But hopefully these remaining issues can still be solved now...

csuka · 2024-09-29T09:28:28Z

Or if maybe these "broken pipe" errors also cause the memory leak.

Perhaps this could help you out to find the root cause of the issue:

"A "broken pipe" error typically indicates that a process was attempting to write data to another process or socket, but the receiving process was closed or became unavailable before the data transfer was complete. This can occur for various reasons, including:

Network Issues: If the communication was happening over a network, a drop in network connectivity or a closed connection can result in a broken pipe.
Process Termination: If a receiving process crashes, exits, or is terminated unexpectedly, any ongoing write operation to that process will fail with a broken pipe error.
Resource Limits: Resource limitations like memory or file descriptors reaching their maximum capacity can cause one process to close, resulting in a broken pipe.
Timeouts: The receiving process may have a timeout set and, upon expiration, could close the connection while data is still being sent."

fionera · 2024-09-29T11:35:16Z

Did you check with before the fixes I pushed a couple days ago? I found the resource leak which was because of a wrong usage of defer statements. The brokenn pipe issues are happening when haproxy decides to close the connection because of either too long evaluation times or too many requests. I am currently investivating why exactly haproxy does that and will keep you updated.

mmilitzer · 2024-09-29T12:08:00Z

Did you check with before the fixes I pushed a couple days ago?

Yes, I used the latest version available (commit 6ea3e43)

fionera · 2024-09-30T06:50:58Z

I cannot reproduce this behaviour on my side. Can you run the newest commit with the memory profile flag?

If we write it too early and eviction triggers, we will return it to the pool and a race occurs.

fionera · 2024-09-30T11:50:27Z

Did you check with before the fixes I pushed a couple days ago?

Yes, I used the latest version available (commit 6ea3e43)

Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)

mmilitzer · 2024-10-01T08:13:19Z

Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :)

I've tried the latest rewrite branch code and I have partially good news: The memory leak is gone, so I couldn't reproduce this anymore. Also the CPU usage was much better and on around the same level than the version from current main branch. So it looks like the memory leak previously was also eating CPU cycles and that's why CPU usage was higher. So the resource usage issues seem to be solved now, not worse than main branch anymore.

However, I'm still seeing "broken pipe" etc. output in /var/log/messages frequently which doesn't happen when running the code from the main branch:

2024-09-30T17:26:00.221389+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too 
2024-09-30T17:26:00.221466+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:26:00 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019387+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: write: broken pipe
2024-09-30T17:27:46.019485+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 disconnect frame with code 3: frame is too big
2024-09-30T17:27:46.019564+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:27:46 write tcp 127.0.0.1:9000->127.0.0.1:46636: use of closed network connection
2024-09-30T17:29:05.046278+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:05.046511+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:05 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683271+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 disconnect frame with code 3: frame is too big
2024-09-30T17:29:37.683348+00:00 web05 coraza-spoa[257058]: 2024/09/30 17:29:37 write tcp 127.0.0.1:9000->127.0.0.1:59350: write: broken pipe

So this is just a short snippet from the log but it's those same three errors ("frame is too big", "use of closed network connection" and "broken pipe") that repeat again and again. And "frame is too big" seems new, so I think I didn't see this before when I tested the earlier version that still had the memory leak.

NOTE: I tested this all still with go 1.21 because I don't have a newer version on AlmaLinux 9 and was too lazy to build the newer one from source. So I've reverted your latest commit d0ba9f8 and also changed the go version in go.mod back to 1.21 to make it work. And I don't know if that's a problem, so if there's anything why go >= 1.23.1 is really required...

superstes · 2024-10-06T14:49:19Z

go.mod

@@ -1,27 +1,30 @@
 module github.com/corazawaf/coraza-spoa

-go 1.19
+go 1.23.1


With go 1.23.1 the application fails to start for me.
With 1.22.3 (as used by the waf-core https://github.com/corazawaf/coraza/blob/main/go.mod#L3) it works without any config changes

Error:
Oct 06 16:43:57 lb01 coraza[79533]: {"level":"fatal","error":"initializing application '\\x00': invalid WAF config from string: failed to readfile: read /etc/coraza-spoa/coraza.conf: invalid argument","time":"2024-10-06T16:43:57+02:00","message":"Failed creating applications"}

same error with go 1.23.2 :
9:45PM FTL Failed creating applications error="initializing application '\x00': invalid WAF config from string: failed to readfile: read /opt/coraza-spoa/etc/coraza.conf: invalid argument"

Interesting. Will try to reproduce

superstes · 2024-10-23T09:14:16Z

Had it running the last 20 days without any major issues: (with go 1.22.3)

Only a few errors per day in relation to the hits/requests
Memory (100% = 4GB) and CPU (3 cores) utilization are stable:

dknight-bg · 2024-10-23T19:02:19Z

I get an error when result processing is enabled ( response_check: true ) :

9:49PM PNC Error handling request error="response id is empty"

spoe-agent coraza-agent
    messages coraza-req coraza-res
    option var-prefix coraza
    option set-on-error error
    timeout hello      2s
    timeout idle       2m
    timeout processing 500ms
    use-backend coraza-spoa
    log global

spoe-message coraza-req
    args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body
    event on-frontend-http-request

spoe-message coraza-res
    args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body
    event on-http-response`

fionera · 2024-10-23T19:26:36Z

I get an error when result processing is enabled ( response_check: true ) :

9:49PM PNC Error handling request error="response id is empty"


spoe-agent coraza-agent

    messages coraza-req coraza-res

    option var-prefix coraza

    option set-on-error error

    timeout hello      2s

    timeout idle       2m

    timeout processing 500ms

    use-backend coraza-spoa

    log global



spoe-message coraza-req

    args app=str(test_app) src-ip=src src-port=src_port dst-ip=dst dst-port=dst_port method=method path=path query=query version=req.ver headers=req.hdrs body=req.body

    event on-frontend-http-request



spoe-message coraza-res

    args app=str(test_app) id=var(txn.e2e.id) version=res.ver status=status headers=res.hdrs body=res.body

    event on-http-response`

The ID argument is using the wrong var. It would have to be txn.coraza.id

dknight-bg · 2024-10-24T07:03:59Z

I get an error when result processing is enabled ( response_check: true ) :
9:49PM PNC Error handling request error="response id is empty"

The ID argument is using the wrong var. It would have to be txn.coraza.id

Thank you, it works correctly with tnx.coraza.id parameter
I got confused by the sample configuration example/haproxy/coraza.cfg from this branch

jptosso · 2024-11-07T11:06:18Z

Merged. Work will continue from here; v0.1 was created to keep the old version. This will be tagged as 1.0.0-experimental

fionera marked this pull request as draft February 18, 2024 21:46

fionera requested a review from sts February 18, 2024 21:53

jptosso previously approved these changes Feb 19, 2024

View reviewed changes

jptosso self-requested a review February 19, 2024 08:23

jcchavezs reviewed Feb 19, 2024

View reviewed changes

.github/workflows/container-image.yaml Show resolved Hide resolved

jcchavezs reviewed Feb 19, 2024

View reviewed changes

internal/application.go Show resolved Hide resolved

jcchavezs reviewed Feb 19, 2024

View reviewed changes

internal/application.go Outdated Show resolved Hide resolved

jcchavezs reviewed Feb 19, 2024

View reviewed changes

fionera force-pushed the rewrite branch 3 times, most recently from ee4e7f7 to f0db161 Compare February 19, 2024 19:39

fionera marked this pull request as ready for review March 11, 2024 00:58

fionera requested a review from jcchavezs March 11, 2024 00:58

fionera force-pushed the rewrite branch from 5c84b9d to eec5451 Compare June 21, 2024 19:39

fionera added 6 commits June 21, 2024 22:04

feat: Initial commit after rewrite

ddbb135

feat: move id generation from haproxy to app

1054408

This allows us to not rely on the http header, but instead use a variable inside the transaction.

feat: implement config handling

5be96f8

This does leak fds as the logging target never gets closed, this has to be fixed before release but this will work for now.

ci: remove codeql runner

4639145

It is not correctly configured anyway and we have go specific linters enabled

feat: add built-in coreruleset

59f0463

fix: remove config file fd leak

8445726

I don't know why I didn't just close it after leaving the function. There is no other use of the fd...

fionera force-pushed the rewrite branch from eec5451 to 8445726 Compare June 21, 2024 20:04

DavidProdinger mentioned this pull request Sep 27, 2024

feat: add config validate functionality without starting the server #110

Open

Add profile flags

dc50899

fionera added 5 commits September 30, 2024 13:22

Skip processing if disabled

eda0e54

Write transaction only to cache after setup

220b4bf

If we write it too early and eviction triggers, we will return it to the pool and a race occurs.

chore: bump haproxy-go version

daf885a

chore: delete go.work

9934d40

ci: bump go version

d0ba9f8

fionera added 2 commits October 5, 2024 02:46

fix: correct transaction error and timeout handling

8d8db4e

feat: remove not implemented transaction limit

d7daf2e

superstes reviewed Oct 6, 2024

View reviewed changes

jptosso merged commit 577b335 into main Nov 7, 2024
0 of 5 checks passed

jptosso deleted the rewrite branch November 7, 2024 11:06

This was referenced Nov 11, 2024

draft: major refactor #90

Closed

Relies on official e2e #85

Closed

New application logging (zerolog) & error handling #75

Closed

docker/e2e/e2e.sh: use --max-time for curl #98

Closed

WilliamDEdwards mentioned this pull request Nov 19, 2024

Timeline to stable? #137

Open

fzipi mentioned this pull request Nov 23, 2024

fix: revert golang major upgrade #140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement Coraza-SPOA #103

Reimplement Coraza-SPOA #103

fionera commented Feb 18, 2024 •

edited

Loading

jcchavezs Feb 19, 2024 •

edited

Loading

fionera Feb 19, 2024

fionera Feb 19, 2024

jcchavezs Feb 20, 2024 •

edited

Loading

jcchavezs Feb 19, 2024

fionera Feb 19, 2024

fionera Feb 19, 2024

fionera Feb 19, 2024

fionera Mar 11, 2024

jcchavezs Apr 16, 2024

jcchavezs left a comment

fionera commented Feb 25, 2024

fionera commented Mar 11, 2024

fzipi commented Jun 12, 2024

fionera commented Jun 14, 2024

fionera commented Jun 21, 2024

fionera commented Jun 21, 2024

mmilitzer commented Sep 29, 2024

csuka commented Sep 29, 2024

fionera commented Sep 29, 2024

mmilitzer commented Sep 29, 2024

fionera commented Sep 30, 2024

fionera commented Sep 30, 2024

mmilitzer commented Oct 1, 2024

superstes Oct 6, 2024 •

edited

Loading

dknight-bg Oct 23, 2024

fionera Oct 23, 2024 •

edited

Loading

superstes commented Oct 23, 2024 •

edited

Loading

dknight-bg commented Oct 23, 2024

fionera commented Oct 23, 2024

dknight-bg commented Oct 24, 2024

jptosso commented Nov 7, 2024

Reimplement Coraza-SPOA #103

Reimplement Coraza-SPOA #103

Conversation

fionera commented Feb 18, 2024 • edited Loading

jcchavezs Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcchavezs Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcchavezs left a comment

Choose a reason for hiding this comment

fionera commented Feb 25, 2024

fionera commented Mar 11, 2024

fzipi commented Jun 12, 2024

fionera commented Jun 14, 2024

fionera commented Jun 21, 2024

fionera commented Jun 21, 2024

mmilitzer commented Sep 29, 2024

csuka commented Sep 29, 2024

fionera commented Sep 29, 2024

mmilitzer commented Sep 29, 2024

fionera commented Sep 30, 2024

fionera commented Sep 30, 2024

mmilitzer commented Oct 1, 2024

superstes Oct 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fionera Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

superstes commented Oct 23, 2024 • edited Loading

dknight-bg commented Oct 23, 2024

fionera commented Oct 23, 2024

dknight-bg commented Oct 24, 2024

jptosso commented Nov 7, 2024

fionera commented Feb 18, 2024 •

edited

Loading

jcchavezs Feb 19, 2024 •

edited

Loading

jcchavezs Feb 20, 2024 •

edited

Loading

superstes Oct 6, 2024 •

edited

Loading

fionera Oct 23, 2024 •

edited

Loading

superstes commented Oct 23, 2024 •

edited

Loading