You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running version 1.1.5 (though 1.1.2 also showed the same behavior). Prior 1.0.x versions I've used have not had the panic in the logs. Relevant log portions:
2019-03-12T16:46:33Z "1 pod(s) pending deletion, sleeping 8s"
2019-03-12T16:46:37Z "Draining node timeout reached"
2019-03-12T16:46:37Z "0 kube-dns pod(s) found"
2019-03-12T16:46:37Z "Done draining kube-dns from node"
2019-03-12T16:46:38Z "Node deleted"
2019-03-12T16:46:38Z "322 minute(s) to go before kill, keeping node"
2019-03-12T16:46:38Z "Sleeping for 640 seconds..."
panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc000222000)
/usr/local/go/src/sync/waitgroup.go:132 +0xae
main.main()
/estafette-work/main.go:171 +0x956
Seems it may be related to when the node the killer is on is self-killed. The rest of the logs seems to indicate another killer processes was spun up in the prior minute or two. Both processes then alternate messages like "1 pod(s) pending deletion, sleeping 9s".
The second newly-created killer pod (which ran 8s after the above process) has the expected notices that the node has already been deleted:
2019-03-12T16:46:46Z "Draining node timeout reached"
2019-03-12T16:46:46Z "0 kube-dns pod(s) found"
2019-03-12T16:46:46Z "Done draining kube-dns from node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error deleting node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error while processing node"
The new pod then continues on normally, and the old pod apparently dies and is no more.
The text was updated successfully, but these errors were encountered:
One way to fix it would be to use the downward api and inject the node name as environment variable then add a condition in the kubernetes.DrainNode function to prevent deleting it-self (the estafette-gke-preemptible-killer pod) then the node will be deleted, and the pod re-scheduled on another node.
Running version
1.1.5
(though1.1.2
also showed the same behavior). Prior1.0.x
versions I've used have not had the panic in the logs. Relevant log portions:Seems it may be related to when the node the killer is on is self-killed. The rest of the logs seems to indicate another killer processes was spun up in the prior minute or two. Both processes then alternate messages like "1 pod(s) pending deletion, sleeping 9s".
The second newly-created killer pod (which ran 8s after the above process) has the expected notices that the node has already been deleted:
The new pod then continues on normally, and the old pod apparently dies and is no more.
The text was updated successfully, but these errors were encountered: