Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make fsm ingoing channel finite #2862

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hedibouattour
Copy link

Currently we use an infinite channel (using a queue) for incoming messages, this results in explosion in memory consumption in the case of a route update storm. This change prevents this behavior by using a fixed size buffered channel.

this prevents memory consumption from explosing in the case of a route update storm
@serejkus
Copy link
Contributor

disclaimer: I'm not a project maintainer, so this comments are JFYI.

tl;dr: I think that this PR can't be merged because it introduces a breaking change for some usage scenarios. It would be better to make this behaviour optional with default to an infinite channel, and give gobgp user a config option to switch to a buffered channel of user-specified length.

This change can be a breaking one for some gobgp setups, specifically the one with lots of peers. All parsed peers' messages go to the single fsm goroutine. Buffered channels without explicit scheduling of messages (based on peer class, for example) can lead to a potential problem: some peers might push all their messages to the fsm loop, while the others may block on sending to the finite buffer channel. This can lead to filling tcp socket buffers and loosing bgp keepalive messages, resulting in resetting peer connection and resending full state. This is a metastable failure state (pdf) that won't heal itself.

The situation might get worse in case of a peer that went crazy and got into a restart loop, sending full state: without scheduling policy for messages it can become fatal. My colleagues got into gobgp OOM in exactly this scenario, so the problem with infinite channel is real. But I think that the solution is more complex than a buffered channel.

My suggestion is to leave the default behavior with unbuffered channels. Make a buffered channel enabled by an option from the config, giving the user the opportunity to select the buffer size.

@maxime-peim
Copy link

I would disagree with the previous comment. Indeed, gobgp would start dropping packets and keep alive, but pushing messages in a queue quicker than you can process them is not a great idea. The gobgp memory consumption would just grow forever, until it runs out of memory.
Putting back the pressure to the peers is actually better in my opinion.
GoBGP should handle more peers than it can, and surely the model of having one single goroutibe to threat all incoming message is not the best choice there.

@serejkus
Copy link
Contributor

@maxime-peim I do agree that unbuffered channels should go away. My point is that this exact change can replace OOM with cyclic session re-establishment, and both of these problems are equally bad. My proposals:

  • Short-term: make this buffer optional, with default to unbuffered behavior. Please, note, that you still can have buffered channels if you enable it in config.
  • Long-term: make buffered channel with scheduler for messages between the peer input and FSM goroutine. And, likely between FSM goroutine and output. It can mitigate the risk of working with a peer that started flooding updates.

There can also be a mid-term solution with a buffered channel plus moving adj-rib-in processing in the same goroutine as the peer input. It would allow to aggregate "flapping" updates and to not starve the FSM goroutine.

@fujita
Copy link
Member

fujita commented Dec 29, 2024

I think that it's difficult to choose the right size of the channel. If it's too small, it would be the bottleneck on a system with lots of memory.

My fundamental question is that how you can handle out-of-memory in Go. I prefer to remove the incoming channel. The rx go routine reads from a socket, creates a Path object, and directly inserts it to the table with its lock. In C, we can detect the failure to allocate a Path object and do something. How can we handle in Go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants