-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not requeue for running Job #237
base: main
Are you sure you want to change the base?
Do not requeue for running Job #237
Conversation
Polling the Job status unnecessary causing k8s API load and noisy logging. The caller of DoJob() should declare that it owns a Job via Owns(&batchv1.Job{}) in its SetupWithManager call. Then the caller will be automatically reconciled when the Job status is changed. There is an edge case when the Job is just created and immediately read back from the API causing NotFound temporarily. In this case returning the NotFound error is simpler than asking for and explicit Requeue and it will still trigger the automatic requeue.
I marked this as Draft as I need to ensure all the DoJob users declaring that they Own the Job it creates. |
We also still need a way for the caller to decide when the Job is finished successfully ... |
not sure I understand that, is there a case other then |
Sorry. I wasn't clear. So if we remove all the requeue returns from the DoJob, then that either returns an error, or return err=nil. The latter can either means the job is still running or that the job is successfully finished. Most of the DBsync logic in the caller needs an explicit knowledge about the case when the Job is finished, to store the Hash in the status. https://github.com/openstack-k8s-operators/keystone-operator/blob/203185e59d8912e4a0fc3d169d09bb6a2b468ae8/controllers/keystoneapi_controller.go#L310-L330 I guess we don't want to store the Job Hash while the Job still running. Or do we? |
right, thanks for explanation, haven't thought about that while checking this PR the first time. Could do one of the following, either return another bool for completed or not, or could also return a custom error on "still running" and then the caller can evaluate the err for the type, similar to to We could probably store the hash while the job is still running to identify if it changed during its execution and recreate with the new definition, but it might be better to wait for it to be finished and then trigger a new one and not cancel it in the middle. |
@@ -39,15 +39,15 @@ func NewJob( | |||
job *batchv1.Job, | |||
jobType string, | |||
preserve bool, | |||
timeout time.Duration, | |||
timeout time.Duration, // unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we should still support this, like when 0 passed it is not used and something higher to consider requeue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal is to get rid of the requeue within this helper. I can even drop the field from the Job struct as I anyhow need to do adaptation PRs in other operators so I can do the signature change as well at the same time.
If we return the status of the job to the caller then the caller can decide to requeue or just return. Still I believe that we should never need to explicitly requeue due to a status of a k8s managed resource as we can watch that instead. (The situation can be different for non k8s resource statuses like when we depend on something to happen with openstack or galera itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we proceed with this? looks like a valid and nice to have fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should. If you have time feel free to pick it up. This needs adaptation in all our service-operators.
From the DoJob perspective if the Job is still running that is not an error condition. So I lean towards either returning a bool even returning an enum{Active,Succeeded,Failed} together with the error that can be nil.
Good point. I need to be careful about the re-creation of the Job if the hash changes in the middle of the Job execution. I agree that killing a Job in the middle can cause unwanted side effects. bottom line, I have to work on this PR more and verify the integration of the change with our existing operators. Thanks for the feedback so far. I think I got my answer. At this point as I only wanted to verify that the idea of this PR is not controversial. I will ping you again when I have a completed PR and some example integration changes with other operators. |
Polling the Job status unnecessary causing k8s API load and noisy logging. The caller of
DoJob()
should declare that it owns a Job viaOwns(&batchv1.Job{})
in itsSetupWithManager
call. Then the caller will be automatically reconciled when the Job status is changed.There is an edge case when the Job is just created and immediately read back from the API causing NotFound temporarily. In this case returning the NotFound error is simpler than asking for and explicit Requeue and it will still trigger the automatic requeue.