panic: sync: WaitGroup is reused before previous Wait has returned #15

smartyjohn · 2019-03-12T20:16:03Z

Running version 1.1.5 (though 1.1.2 also showed the same behavior). Prior 1.0.x versions I've used have not had the panic in the logs. Relevant log portions:

2019-03-12T16:46:33Z "1 pod(s) pending deletion, sleeping 8s"
2019-03-12T16:46:37Z "Draining node timeout reached"
2019-03-12T16:46:37Z "0 kube-dns pod(s) found"
2019-03-12T16:46:37Z "Done draining kube-dns from node"
2019-03-12T16:46:38Z "Node deleted"
2019-03-12T16:46:38Z "322 minute(s) to go before kill, keeping node"
2019-03-12T16:46:38Z "Sleeping for 640 seconds..."
panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc000222000)
 	/usr/local/go/src/sync/waitgroup.go:132 +0xae
main.main()
 	/estafette-work/main.go:171 +0x956

Seems it may be related to when the node the killer is on is self-killed. The rest of the logs seems to indicate another killer processes was spun up in the prior minute or two. Both processes then alternate messages like "1 pod(s) pending deletion, sleeping 9s".

The second newly-created killer pod (which ran 8s after the above process) has the expected notices that the node has already been deleted:

2019-03-12T16:46:46Z "Draining node timeout reached"
2019-03-12T16:46:46Z "0 kube-dns pod(s) found"
2019-03-12T16:46:46Z "Done draining kube-dns from node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error deleting node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error while processing node"

The new pod then continues on normally, and the old pod apparently dies and is no more.

The text was updated successfully, but these errors were encountered:

etiennetremel · 2019-03-28T19:22:46Z

One way to fix it would be to use the downward api and inject the node name as environment variable then add a condition in the kubernetes.DrainNode function to prevent deleting it-self (the estafette-gke-preemptible-killer pod) then the node will be deleted, and the pod re-scheduled on another node.

etiennetremel added the bug label Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: sync: WaitGroup is reused before previous Wait has returned #15

panic: sync: WaitGroup is reused before previous Wait has returned #15

smartyjohn commented Mar 12, 2019 •

edited

Loading

etiennetremel commented Mar 28, 2019

panic: sync: WaitGroup is reused before previous Wait has returned #15

panic: sync: WaitGroup is reused before previous Wait has returned #15

Comments

smartyjohn commented Mar 12, 2019 • edited Loading

etiennetremel commented Mar 28, 2019

smartyjohn commented Mar 12, 2019 •

edited

Loading