Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to check the liveness for connection with controller #921

Closed
dofmind opened this issue Aug 7, 2024 · 3 comments · Fixed by #920
Closed

Add to check the liveness for connection with controller #921

dofmind opened this issue Aug 7, 2024 · 3 comments · Fixed by #920
Milestone

Comments

@dofmind
Copy link
Contributor

dofmind commented Aug 7, 2024

Please describe what you would like to see

In #857, describes like below,

The bluechi-agent on the other side can detect a disconnect rather soon due to the Heartbeat feature. Such a periodic check of the connection status on an application layer could be used in the bluechi-controller as well. Based on the last seen timestamp, it could actively disconnect nodes.

However, bluechi-agent still takes quite a while to detect a disconnect of the bluechi-controller. (It takes about 60 seconds even though applied TCP KeepAlive options introduced by #674)

For example, when i unplugged the cable at the Aug 07 11:23:10, bluechi-agent detects disconnection at the Aug 07 11:24:08

Aug 07 11:22:46 42dot-ak7 bluechi-agent[1167]: Connecting to controller on tcp:host=192.168.16.101,port=842
Aug 07 11:22:46 42dot-ak7 bluechi-agent[1167]: Connected to controller as 'ak7_master_main'
Aug 07 11:24:08 42dot-ak7 bluechi-agent[1167]: Disconnected from controller
Aug 07 11:24:08 42dot-ak7 bluechi-agent[1167]: Connecting to controller on tcp:host=192.168.16.101,port=842
Aug 07 11:24:11 42dot-ak7 bluechi-agent[1167]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Aug 07 11:24:11 42dot-ak7 bluechi-agent[1167]: Trying to connect to controller (try 1)
Aug 07 11:24:11 42dot-ak7 bluechi-agent[1167]: Connecting to controller on tcp:host=192.168.16.101,port=842
Aug 07 11:24:14 42dot-ak7 bluechi-agent[1167]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Aug 07 11:24:14 42dot-ak7 bluechi-agent[1167]: Trying to connect to controller (try 2)
...

Please describe the solution you'd like

If bluechi-controller emits periodic heartbeat signal to the agents and each agent stores the current timestamp in the last seen timestamp property on callback handler of the heartbeat signal, bluechi-agent can check the liveness for connection with controller as implemented in bluechi-controller at #857.

@engelmi
Copy link
Member

engelmi commented Aug 8, 2024

Thanks for opening the issue! @dofmind

I can reproduce this and confirm that the bluechi-agent takes quite a while to detect the disconnect.
However, the reason for the periodic heartbeat check in controller (#857) was to prevent any "zombie agents" - where the controller thinks the agent is still connected, but isn't and blocks any reconnect attempt from them. The agent on the other side should be the passive part and wait for instructions. So I am wondering if there is a similar use case for the bluechi-agent? And what should be the behavior of the agent on detecting a disconnect? @dofmind

I assume the default behavior of the agent detecting the disconnect would be to actively trigger the disconnect and start with the reconnect loop (as it is done in #920). This would cause emitting a change signal for Status on the agent machine so that external applications could react on that.

@dofmind
Copy link
Contributor Author

dofmind commented Aug 9, 2024

I assume the default behavior of the agent detecting the disconnect would be to actively trigger the disconnect and start with the reconnect loop (as it is done in #920). This would cause emitting a change signal for Status on the agent machine so that external applications could react on that.

I agree with you. If an agent machine is disconnected over the threshold time, the unit services running on the agent machine should be stopped because the state manager will run them on another agent machine. If the agent emits a change signal for Status faster, the external daemon of the agent machine can decide faster whether stop unit services running on the agent machine.

@engelmi
Copy link
Member

engelmi commented Aug 12, 2024

Makes sense to me. I think this would be a valuable feature similar to #857.
I just took a look at your PR #920 and left some comments - overall looks it great! @dofmind
In addition (and after merging #920), I'd also like to document the different heartbeats and thresholds in more detail, so I created issue #927.

@engelmi engelmi linked a pull request Aug 13, 2024 that will close this issue
@engelmi engelmi added this to the v0.9 milestone Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants