-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for minimal reboot period #904
base: main
Are you sure you want to change the base?
Conversation
f398712
to
ac20ae6
Compare
The flag --min-reboot-period can be used to define the minimal duration between reboots of a node. Signed-off-by: leonnicolas <[email protected]>
ac20ae6
to
12e3818
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR.
- only warn on misconfiguration - use `node.Annotations` not `node.GetAnnotations()` - explicitly check `annotate-node` value before reading an annotation Signed-off-by: leonnicolas <[email protected]>
9138acf
to
3fb674f
Compare
Thanks @ckotzbauer for the super fast review! I pushed a commit according to your comments. |
Can I do anything about the failing e2e tests? |
cmd/kured/main.go
Outdated
} | ||
|
||
if lastSuccessfulRebootWithinMinRebootPeriod(node) { | ||
log.Infof("Last successful reboot within minimal reboot period") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to log the next time that a reboot will be allowed. So maybe we convert lastSuccessfulRebootWithinMinRebootPeriod() into a "getNextAllowedRebootTime()" func, and then do something like
nextRebootTime := getNextAllowedRebootTime(node)
if time.Now().Before(nextRebootTime) {
log.Infof("Not allowed to reboot until %s", nextRebootTime)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example confuses me. Why add minRebootPeriod
to the value returned by getNextAllowedRebootTime
?
Either way returning the next allowed reboot time (without adding the minRebootPeriod) or returning the last successful reboot time (and adding minRebootPeriod) sound good to me.
But what should we do if we don't know the last reboot time? Right now we just return "you should reboot" when there is no last successful reboot time. Would the getNextAllowedRebootTime just return &time.Time{}
if we don't know about the last successful reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, copy/paste error in example (updated!).
In the instance where we aren't able to determine the last reboot time we can probably return time.Now(), and then update the statement to
if time.Now().Compare(nextRebootTime) < 0 {
log.Infof("Not allowed to reboot until %s", nextRebootTime)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you ok with returning and error explicitly? Like
func nextRebootTime(node) (*time.Time, error)
So we don't have to return time.Now() when we don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example we could do
if minRebootPeriod > 0 {
if t, err := nextAllowedReboot(node); err != nil {
log.Warnf("Failed to determine next allowed reboot time: %s", err.Error())
} else if diff := t.Sub(time.Now()); diff > 0 {
log.Infof("Reboot not allowed until %s (%s)", t.String(), diff.String())
}
}
if we don't care about logging the duration (in how much time can we reboot), we should use time.Now().Before(t)
instead, I guess. I am really bad with subtracting times in my head.
I see that it would work without returning an error. It would even take less code, but the users would never see the warning about a missing annotation or anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with an error, but in the case of the annotation not being found, that's not actually an error, as this is a new feature and plenty of users won't configure kured to use it. So silently ignoring that and returning time.Now() seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would only check the annotation if minRebootPeriod
is configured to be larger than 0.
Right, if you suddenly turn on the feature you would get one "wrong" error log entry before the first reboot. After the first reboot with the --minRebootPeriod
flag, a missing annotation would actually be a real error.
If we don't care so much about logging that error, I will totally agree to returning time.Now()
if something unexpected happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if you suddenly turn on the feature you would get one "wrong" error log entry before the first reboot.
Could you clarify why this is so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You run cured without the --min-reboot-period
flag. Kured will not annotate the node with the weave.works/kured-last-successful-reboot
flag. Now you want to use the "new" feature, edit your manifest and roll out the new kured daemon set. Normally none of the pods will hold the lock from
Line 666 in d0bdc11
if holding(lock, &nodeMeta, concurrency > 1) { |
weave.works/kured-last-successful-reboot
annotation to its nodes. We wouldn't want that anyways because then restarting/rollilng out kured daemon set would falsify the last successful reboot time. Since the feature is enabled now, kured would check for the weave.works/kured-last-successful-reboot
annotation, but fail since it is not there yet. So if we log an error when the annotation is missing, we would see one error log for each tick until the first reboot. That the annotation is missing is expected before the first reboot. However, after the first reboot the missing annotation would be an unexpected error.
Only after the first reboot with the --min-reboot-period
flag, this feature would actually work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is right here. Should we never log an error on missing annotations to avoid confusion before the first reboot to the cost of silencing real errors after the first reboot? Or the other way around? (I added a commit that would log the errors, happy to change it though)
If we should not reboot because the last successful reboot is within the `min-reboot-period`, log the soonest reboot time. Note, this will also log an error if we cannot determine the last successful reboot time. However this error is expected when running kured for the first time with `--min-reboot-period`. The `last-successful-reboot` annotation will be added to the node after the first reboot with this feature enabled and then only unexpected error should be logged. Signed-off-by: leonnicolas <[email protected]>
any blocker to have this merged? |
This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days). |
How can I help? |
@jackfrancis Can you maybe have a look again please? |
I read the comments here above and the code.
There are 4 cases I see (tell me if I am wrong): Case A: Everything is good, setting a fake value on the first kured boot is fine (value is small). Did I miss something? |
This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days). |
aa627d9
to
c4e56b9
Compare
Use the `RebootBlocker` interface to implement the min-reboot-period feature. Signed-off-by: leonnicolas <[email protected]>
This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days). |
The flag --min-reboot-period can be used to define the minimal duration
between reboots of a node.