-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change CKE to proceed rebooting immediately after draining of node is completed #707
Conversation
d7f34a3
to
4f48336
Compare
8762f40
to
f27329c
Compare
08d1831
to
7eab07c
Compare
server/strategy.go
Outdated
if len(ops) > 0 { | ||
phaseReboot = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing phaseReboot
!
I checked this variable carefully, and I was convinced that this function need not return phaseReboot
.
This function should only determine ops
from the given information.
The caller(DecideOps()
) should determine the phase. And which can be determined only by the returned ops
.
Could you change the return value of this function and change the caller as following?
It's okay to change it in another PR.
https://github.com/cybozu-go/cke/blob/v1.28.0/server/strategy.go#L106-L113
// 11. Reboot nodes if reboot request has been arrived to the reboot queue, and the number of unreachable nodes is less than a threshold.
if ops := rebootOps(c, constraints, rebootArgs, nf); len(ops) > 0 {
if !nf.EtcdIsGood() {
log.Warn("cannot reboot nodes because etcd cluster is not responding and in-sync", nil)
return nil, cke.PhaseRebootNodes
}
return ops, cke.PhaseRebootNodes
}
Signed-off-by: YZ775 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In the current CKE, even if
max_concurrent_reboots
is larger than 1, draining is conducted by serial and rebooting is conducted after draining of all drainable node is finished.This behavior cause long down time of first drained node.
This PR changes the CKE to move rebooting immediately after draining of each node is finished.
In addition, this PR fixes behavior of canceled reboot queue entry.