-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodes do not get deleted #6
Comments
can this be a timeout on GCloud side? I wasn't able to see any outage during that period, but if this node doesn't get processed, it should be on the next loop and if the error still persist, maybe there is more information from the logs right before this happen |
I'm seeing something similar happen. My guess is this might be happening because kube-dns is being killed before the GCloud client is used, so it fails to resolve the host name when authenticating.
|
Although kube-dns - if present on the node - is actively deleted by https://github.com/estafette/estafette-gke-preemptible-killer/blob/master/main.go#L296 since kube-dns is running HA this shouldn't be an issue. However it does turn out that kubernetes engine - built to be resilient - isn't very resilient in the light of preemptions. The master doesn't update services with pods on a preempted node fast enough to no longer send traffic there. We've seen this by having frequent kube-dns issues correlating with real preemptions by Google, not the ones issued by our preemptible-killer. |
@JorritSalverda We're getting dns errors intermittently on our GKE preemptibles (with preemptible killer) with services in the cluster trying to resolve other services in the same cluster. |
@jstephens7 we've seen the same and actually moved away from preemptibles for the time being. It's unrelated to this controller, but happens when a node really gets preempted by Google before this controller would do it instead. GKE doesn't handle preemption gracefully, but just kill the node at once. This leaves the Kubernetes master in the blind for a while until it discovers that the node is no longer available. In the mean time the iptables don't get updated and traffic still gets routed to the unavailable node. I would expect this scenario to be handled better, since you want Kubernetes to be resilient in the face of real node malfunction. For AWS there's actually a notifier that warns you a spot instance is going down, but GCP doesn't have such a thing currently. See https://learnk8s.io/blog/kubernetes-spot-instances for more info. |
Seems like this could be a good solution https://github.com/GoogleCloudPlatform/k8s-node-termination-handler |
@JorritSalverda have you completely given up on preemptibles in production (because of this issue)? Just exploring the idea so would love to hear your feedback. And would @theallseingeye's suggestion mitigate this? |
When deleting node, I am experiencing this error
I would say that my serviceaccount json is well upload to the pod , and the account has the proper permissions..so I dont know what is happening |
Hi @santinoncs, do you use the Helm chart? And what version? We run it with a service account with roles |
Hi @tmirks we did abandon preemptibles for a while since the pressure on europe-west1 mounted and preemptions became more commonplace. The fact that GKE wasn't aware of preemptions caused a lot of trouble with kube-dns requests getting sent to no longer existing pods. Now we're testing the k8s-node-termination-handler - see Helm chart at https://github.com/estafette/k8s-node-termination-handler - with this application to ensure both GKE is aware of preemptions and preemptions are less likely to happen all at once. Spreading preemptible nodes across zones should also help in reduce changes on mass preemptions. |
Already working when I copy the ca-certificates file to the container. |
Just FYI, GKE now handles node preemption gracefully, giving pods about 25 seconds to shut down. |
In one of our Kubernetes Engine clusters nodes that should be deleted do not get removed properly. They're already disabled for scheduling and the pods are evicted, but then the following error is logged when the controller tries to delete the vm:
The text was updated successfully, but these errors were encountered: