Add to check the liveness for connection with controller #921

dofmind · 2024-08-07T08:50:54Z

Please describe what you would like to see

In #857, describes like below,

The bluechi-agent on the other side can detect a disconnect rather soon due to the Heartbeat feature. Such a periodic check of the connection status on an application layer could be used in the bluechi-controller as well. Based on the last seen timestamp, it could actively disconnect nodes.

However, bluechi-agent still takes quite a while to detect a disconnect of the bluechi-controller. (It takes about 60 seconds even though applied TCP KeepAlive options introduced by #674)

For example, when i unplugged the cable at the Aug 07 11:23:10, bluechi-agent detects disconnection at the Aug 07 11:24:08

Aug 07 11:22:46 42dot-ak7 bluechi-agent[1167]: Connecting to controller on tcp:host=192.168.16.101,port=842
Aug 07 11:22:46 42dot-ak7 bluechi-agent[1167]: Connected to controller as 'ak7_master_main'
Aug 07 11:24:08 42dot-ak7 bluechi-agent[1167]: Disconnected from controller
Aug 07 11:24:08 42dot-ak7 bluechi-agent[1167]: Connecting to controller on tcp:host=192.168.16.101,port=842
Aug 07 11:24:11 42dot-ak7 bluechi-agent[1167]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Aug 07 11:24:11 42dot-ak7 bluechi-agent[1167]: Trying to connect to controller (try 1)
Aug 07 11:24:11 42dot-ak7 bluechi-agent[1167]: Connecting to controller on tcp:host=192.168.16.101,port=842
Aug 07 11:24:14 42dot-ak7 bluechi-agent[1167]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Aug 07 11:24:14 42dot-ak7 bluechi-agent[1167]: Trying to connect to controller (try 2)
...

Please describe the solution you'd like

If bluechi-controller emits periodic heartbeat signal to the agents and each agent stores the current timestamp in the last seen timestamp property on callback handler of the heartbeat signal, bluechi-agent can check the liveness for connection with controller as implemented in bluechi-controller at #857.

The text was updated successfully, but these errors were encountered:

engelmi · 2024-08-08T14:21:45Z

Thanks for opening the issue! @dofmind

I can reproduce this and confirm that the bluechi-agent takes quite a while to detect the disconnect.
However, the reason for the periodic heartbeat check in controller (#857) was to prevent any "zombie agents" - where the controller thinks the agent is still connected, but isn't and blocks any reconnect attempt from them. The agent on the other side should be the passive part and wait for instructions. So I am wondering if there is a similar use case for the bluechi-agent? And what should be the behavior of the agent on detecting a disconnect? @dofmind

I assume the default behavior of the agent detecting the disconnect would be to actively trigger the disconnect and start with the reconnect loop (as it is done in #920). This would cause emitting a change signal for Status on the agent machine so that external applications could react on that.

dofmind · 2024-08-09T09:25:29Z

I assume the default behavior of the agent detecting the disconnect would be to actively trigger the disconnect and start with the reconnect loop (as it is done in #920). This would cause emitting a change signal for Status on the agent machine so that external applications could react on that.

I agree with you. If an agent machine is disconnected over the threshold time, the unit services running on the agent machine should be stopped because the state manager will run them on another agent machine. If the agent emits a change signal for Status faster, the external daemon of the agent machine can decide faster whether stop unit services running on the agent machine.

engelmi · 2024-08-12T14:03:09Z

Makes sense to me. I think this would be a valuable feature similar to #857.
I just took a look at your PR #920 and left some comments - overall looks it great! @dofmind
In addition (and after merging #920), I'd also like to document the different heartbeats and thresholds in more detail, so I created issue #927.

dofmind mentioned this issue Aug 7, 2024

Add to check the liveness for connection with controller #920

Merged

engelmi mentioned this issue Aug 12, 2024

Extend documentation about heartbeats and thresholds #927

Open

engelmi linked a pull request Aug 13, 2024 that will close this issue

Add to check the liveness for connection with controller #920

Merged

engelmi closed this as completed in #920 Aug 13, 2024

engelmi added this to the v0.9 milestone Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add to check the liveness for connection with controller #921

Add to check the liveness for connection with controller #921

dofmind commented Aug 7, 2024

engelmi commented Aug 8, 2024 •

edited

Loading

dofmind commented Aug 9, 2024 •

edited

Loading

engelmi commented Aug 12, 2024 •

edited

Loading

Add to check the liveness for connection with controller #921

Add to check the liveness for connection with controller #921

Comments

dofmind commented Aug 7, 2024

Please describe what you would like to see

Please describe the solution you'd like

engelmi commented Aug 8, 2024 • edited Loading

dofmind commented Aug 9, 2024 • edited Loading

engelmi commented Aug 12, 2024 • edited Loading

engelmi commented Aug 8, 2024 •

edited

Loading

dofmind commented Aug 9, 2024 •

edited

Loading

engelmi commented Aug 12, 2024 •

edited

Loading