Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodes do not get deleted #6

Open
JorritSalverda opened this issue Nov 20, 2017 · 12 comments
Open

nodes do not get deleted #6

JorritSalverda opened this issue Nov 20, 2017 · 12 comments

Comments

@JorritSalverda
Copy link
Collaborator

In one of our Kubernetes Engine clusters nodes that should be deleted do not get removed properly. They're already disabled for scheduling and the pods are evicted, but then the following error is logged when the controller tries to delete the vm:

{
	"time":"2017-11-20T09:23:10Z",
	"severity":"error",
	"app":"estafette-gke-preemptible-killer",
	"version":"1.0.29",
	"error":"Delete https://www.googleapis.com/compute/v1/projects/***/zones/europe-west1-c/instances/gke-development-euro-auto-scaling-pre-33198d65-gq2m?alt=json: dial tcp: i/o timeout",
	"host":"gke-development-euro-auto-scaling-pre-33198d65-gq2m",
	"message":"Error while processing node"
}
@etiennetremel
Copy link
Contributor

can this be a timeout on GCloud side? I wasn't able to see any outage during that period, but if this node doesn't get processed, it should be on the next loop and if the error still persist, maybe there is more information from the logs right before this happen

@ksuther
Copy link

ksuther commented Oct 21, 2018

I'm seeing something similar happen. My guess is this might be happening because kube-dns is being killed before the GCloud client is used, so it fails to resolve the host name when authenticating.

 jsonPayload: {
  app: "estafette-gke-preemptible-killer"
  error: "Delete https://www.googleapis.com/compute/v1/projects/path/to/instance?alt=json: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: lookup oauth2.googleapis.com on 10.114.0.10:53: dial udp 10.114.0.10:53: connect: network is unreachable"
  host: "test-pool-cb8bed09-17s6"
  message: "Error deleting GCloud instance"
  version: "1.0.35"
 }

@JorritSalverda
Copy link
Collaborator Author

Although kube-dns - if present on the node - is actively deleted by https://github.com/estafette/estafette-gke-preemptible-killer/blob/master/main.go#L296 since kube-dns is running HA this shouldn't be an issue.

However it does turn out that kubernetes engine - built to be resilient - isn't very resilient in the light of preemptions. The master doesn't update services with pods on a preempted node fast enough to no longer send traffic there. We've seen this by having frequent kube-dns issues correlating with real preemptions by Google, not the ones issued by our preemptible-killer.

@jstephens7
Copy link

jstephens7 commented Dec 4, 2018

@JorritSalverda We're getting dns errors intermittently on our GKE preemptibles (with preemptible killer) with services in the cluster trying to resolve other services in the same cluster.
EDIT: It should be noted that we're only having these intermittent connection issues with our preemptibles, the other nodes are having no issues.
I'm asking out of ignorance:
What is the purpose of removing kube-dns on the node?
Would leaving kube-dns on the node remove the dns issues?
And could you clarify your last statement: "We've seen this by having frequent kube-dns issues correlating with real preemptions by Google, not the ones issued by our preemptible-killer."

@JorritSalverda
Copy link
Collaborator Author

@jstephens7 we've seen the same and actually moved away from preemptibles for the time being. It's unrelated to this controller, but happens when a node really gets preempted by Google before this controller would do it instead. GKE doesn't handle preemption gracefully, but just kill the node at once. This leaves the Kubernetes master in the blind for a while until it discovers that the node is no longer available. In the mean time the iptables don't get updated and traffic still gets routed to the unavailable node. I would expect this scenario to be handled better, since you want Kubernetes to be resilient in the face of real node malfunction.

For AWS there's actually a notifier that warns you a spot instance is going down, but GCP doesn't have such a thing currently. See https://learnk8s.io/blog/kubernetes-spot-instances for more info.

@theallseingeye
Copy link

Seems like this could be a good solution https://github.com/GoogleCloudPlatform/k8s-node-termination-handler

@tmirks
Copy link

tmirks commented Jul 20, 2019

@JorritSalverda have you completely given up on preemptibles in production (because of this issue)? Just exploring the idea so would love to hear your feedback.

And would @theallseingeye's suggestion mitigate this?

@santinoncs
Copy link

santinoncs commented Oct 16, 2020

When deleting node, I am experiencing this error

INF Done draining kube-dns from node host=gke-xxxxx
ERR Error deleting GCloud instance error="Delete "https://www.googleapis.com/compute/v1/projects/yyyyyy/zones/europe-west1-b/instances/gke-xxxxxx?alt=json\": oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token\": x509: certificate signed by unknown authority" host=gke-xxxxx
ERR Error while processing node error="Delete "https://www.googleapis.com/compute/v1/projects/yyyyyy/zones/europe-west1-b/instances/gke-xxxxxx?alt=json\": oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token\": x509: certificate signed by unknown authority" host=gke-xxxx

I would say that my serviceaccount json is well upload to the pod , and the account has the proper permissions..so I dont know what is happening

@JorritSalverda
Copy link
Collaborator Author

Hi @santinoncs, do you use the Helm chart? And what version? We run it with a service account with roles compute.instanceAdmin.v1 on the project the GKE cluster is in. That seems to work fine.

@JorritSalverda
Copy link
Collaborator Author

Hi @tmirks we did abandon preemptibles for a while since the pressure on europe-west1 mounted and preemptions became more commonplace. The fact that GKE wasn't aware of preemptions caused a lot of trouble with kube-dns requests getting sent to no longer existing pods. Now we're testing the k8s-node-termination-handler - see Helm chart at https://github.com/estafette/k8s-node-termination-handler - with this application to ensure both GKE is aware of preemptions and preemptions are less likely to happen all at once. Spreading preemptible nodes across zones should also help in reduce changes on mass preemptions.

@santinoncs
Copy link

Hi @santinoncs, do you use the Helm chart? And what version? We run it with a service account with roles compute.instanceAdmin.v1 on the project the GKE cluster is in. That seems to work fine.

Already working when I copy the ca-certificates file to the container.

@vikstrous2
Copy link

Just FYI, GKE now handles node preemption gracefully, giving pods about 25 seconds to shut down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants