Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: sync: WaitGroup is reused before previous Wait has returned #15

Open
smartyjohn opened this issue Mar 12, 2019 · 1 comment
Open
Labels

Comments

@smartyjohn
Copy link

smartyjohn commented Mar 12, 2019

Running version 1.1.5 (though 1.1.2 also showed the same behavior). Prior 1.0.x versions I've used have not had the panic in the logs. Relevant log portions:

2019-03-12T16:46:33Z "1 pod(s) pending deletion, sleeping 8s"
2019-03-12T16:46:37Z "Draining node timeout reached"
2019-03-12T16:46:37Z "0 kube-dns pod(s) found"
2019-03-12T16:46:37Z "Done draining kube-dns from node"
2019-03-12T16:46:38Z "Node deleted"
2019-03-12T16:46:38Z "322 minute(s) to go before kill, keeping node"
2019-03-12T16:46:38Z "Sleeping for 640 seconds..."
panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc000222000)
 	/usr/local/go/src/sync/waitgroup.go:132 +0xae
main.main()
 	/estafette-work/main.go:171 +0x956

Seems it may be related to when the node the killer is on is self-killed. The rest of the logs seems to indicate another killer processes was spun up in the prior minute or two. Both processes then alternate messages like "1 pod(s) pending deletion, sleeping 9s".

The second newly-created killer pod (which ran 8s after the above process) has the expected notices that the node has already been deleted:

2019-03-12T16:46:46Z "Draining node timeout reached"
2019-03-12T16:46:46Z "0 kube-dns pod(s) found"
2019-03-12T16:46:46Z "Done draining kube-dns from node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error deleting node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error while processing node"

The new pod then continues on normally, and the old pod apparently dies and is no more.

@etiennetremel
Copy link
Contributor

One way to fix it would be to use the downward api and inject the node name as environment variable then add a condition in the kubernetes.DrainNode function to prevent deleting it-self (the estafette-gke-preemptible-killer pod) then the node will be deleted, and the pod re-scheduled on another node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants