etcd controller shall not ignore ghost etcd member in health check #11231

mogliang · 2024-11-05T02:28:42Z

Environmental Info:
K3s Version:

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:
k3s cluster with embeded etcd cluster

Describe the bug:
Currently, when there is a ghost etcd member (the member which don't belong to any k3s node), etcd controller choose to ignore it.

k3s/pkg/etcd/etcd.go

Lines 1115 to 1117 in 4adcdf8

    
           if member.Name != name { 
        
           	continue 
        
           }

This could be dangerous. Suppose there is a 3-controlplane nodes k3s cluster with embeded etcd, and 1 ghost member which may already offline (may caused by unclean node removal), then the cluster is actually in unhealthy state and cannot bear another cp node offline.

We are working on k3s cluster-api provider, when we do rollout update on k3s cluster, we encountered issue that node not removed cleanly and have etcd member left, and later it causes quorum loss in the middle of rollout.

Although it can be solved by leveraging annotation etcd.k3s.cattle.io/remove, it's better to have a 2nd guard to ensure no ghost etcd member eixists before doing any cluster managment operation (capi default controlplane implementation have this check). See previous discussion #9841

Steps To Reproduce:

create 3 controlplane node k3s cluster
shutdown 1 controlplane machine and delete that node by kubectl delete node

Expected behavior:
node EtcdIsVoter condition shall be false with message showing the ghost member

Actual behavior:
node EtcdIsVoter condition shows as true

Proposed fix
PullRequest #11232

Additional context / logs:

The text was updated successfully, but these errors were encountered:

brandond · 2024-11-05T04:07:34Z

It is normal for a node to be added to etcd before the kubelet comes up and creates a Kubernetes node object. Are you accounting for this in your error state? This should be handled by member promotion, and on the other side, the finalizer on node objects should prevent nodes from being deleted without being removed from etcd.

Can you provide steps to reproduce this?

mogliang · 2024-11-05T06:05:03Z

It is normal for a node to be added to etcd before the kubelet comes up and creates a Kubernetes node object. Are you accounting for this in your error state?

Good point, CAPI may consider this as abonormal state, and hold any mgmt operation until cluster to be stable (all etcd member mapped to node)

This should be handled by member promotion, and on the other side, the finalizer on node objects should prevent nodes from being deleted without being removed from etcd.
Can you provide steps to reproduce this?

It seems that the issue we were facing is due to stop VM before removing node, not etcd member leak. When we remove etcd first, and then stop VM and delete node, rollout seems to be ok.

Well, we do hope there is way for CAPI to know the etcd member list as another security check, but currently etcd controller doesn't surface such info.

Another idea is to expose the etcd member list as a node annotation, what do you think @brandond ?

brandond · 2024-11-05T23:01:24Z

It seems that the issue we were facing is due to stop VM before removing node, not etcd member leak. When we remove etcd first, and then stop VM and delete node, rollout seems to be ok.

You should be able to delete the node from the cluster (both the kubernetes node, and the etcd member) even when the node is offline. It may take a bit longer if you're waiting for the etcd controller lease to transition to another node to trigger the cluster member remove and cleanup of the finalizer on the node object... but it should succeed. Can you provide steps to reproduce whatever problem you're running into here?

Another idea is to expose the etcd member list as a node annotation

I'm not sure this makes sense to me. You'd have global etcd cluster state, stored as an annotation on individual node objects? If for some reason you need to check etcd cluster membership, I'd probably recommend just querying that directly from etcd (if you must), or via the supervisor API - which is available anonymously if accessed via localhost:

k3s/pkg/etcd/etcd.go

Lines 701 to 703 in 917761c

    
           ir := r.Path("/db/info").Subrouter() 
        
           ir.Use(auth.IsLocalOrHasRole(e.config, version.Program+":server")) 
        
           ir.Handle("", e.infoHandler())

root@systemd-node-1:/# curl -kS https://127.0.0.1:6443/db/info
{"members":[{"ID":4682922252190157995,"name":"systemd-node-1-103abf0d","peerURLs":["https://172.17.0.4:2380"],"clientURLs":["https://172.17.0.4:2379"]}]}

mogliang · 2024-11-06T02:02:53Z

You should be able to delete the node from the cluster (both the kubernetes node, and the etcd member) even when the node is offline. It may take a bit longer if you're waiting for the etcd controller lease to transition to another node to trigger the cluster member remove and cleanup of the finalizer on the node object... but it should succeed. Can you provide steps to reproduce whatever problem you're running into here?

steps:

create a 2 cp nodes cluster.
stop 1 VM
delete node

problem is: If stop VM first, then cluster will lose quorum directly, and there is no chance for deleting node. That's the reason why we need delete etcd member first before stop VM.
But if there is ghost member exists for some reason, then delete etcd member first may not be safe as well. However, i'm not able to reproduce the condtion which has ghost member now.

I'm not sure this makes sense to me. You'd have global etcd cluster state, stored as an annotation on individual node objects? If for some reason you need to check etcd cluster membership, I'd probably recommend just querying that directly from etcd (if you must), or via the supervisor API - which is available anonymously if accessed via localhost:

Indeed, that seems not quite right to surface etcd cluster level info on node.
Access etcd directly is the solution we take right now for capi k3s provider, see ref k3s-io/cluster-api-k3s#75, but this is also not an ideal way. it requires manage etcd ca cert on mgmt cluster, and creating etcd proxy pod on target cluster, which is too heavy, besides, it bypasses k3s etcd controller to manage etcd cluster, that seems not quite right from design perspective.
Well, if etcd controller can expose etcd membership info, we shall be able to avoid the etcd direct access from capi mgmt cluster.

brandond · 2024-11-06T03:02:54Z

create a 2 cp nodes cluster

You should never have a cluster with 2 etcd nodes. As you noted, loss of either node will cause a total cluster outage, and you'll need to reset the etcd cluster to a single node to recover. This is what the --cluster-reset flag is for.

that's the reason why we need delete etcd member first before stop VM.

You can either delete the node object, or add the pre-delete annotation to the node to remove it from the etcd cluster without deleting the node resource itself. This annotation is how Rancher prepares nodes for deletion.

mogliang · 2024-11-07T02:58:26Z

You should never have a cluster with 2 etcd nodes. As you noted, loss of either node will cause a total cluster outage, and you'll need to reset the etcd cluster to a single node to recover. This is what the --cluster-reset flag is for.

2 cp node cluster is the intermediate state when we test scale down from N cp nodes to 1 by capi.

You can either delete the node object, or add the pre-delete annotation to the node to remove it from the etcd cluster without deleting the node resource itself. This annotation is how Rancher prepares nodes for deletion.

Thanks, we are using etcd.k3s.cattle.io/remove annotation. What else does pre-deletion do than removing etcd member?

brandond · 2024-11-07T03:13:20Z

That's all it does.

If you're scaling down to 1, you might consider just running a cluster-reset instead of stepping through all the intermediate node counts one by one.

github-actions · 2024-12-22T20:04:41Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-project-automation bot added this to K3s Development Nov 5, 2024

github-project-automation bot moved this to New in K3s Development Nov 5, 2024

mogliang mentioned this issue Nov 5, 2024

etcd controller detect ghost member and update EtcdIsVoter condition #11232

Draft

caroline-suse-rancher moved this from New to In Triage in K3s Development Nov 14, 2024

github-actions bot added the status/stale label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd controller shall not ignore ghost etcd member in health check #11231

etcd controller shall not ignore ghost etcd member in health check #11231

mogliang commented Nov 5, 2024 •

edited

Loading

brandond commented Nov 5, 2024

mogliang commented Nov 5, 2024

brandond commented Nov 5, 2024

mogliang commented Nov 6, 2024

brandond commented Nov 6, 2024 •

edited

Loading

mogliang commented Nov 7, 2024

brandond commented Nov 7, 2024

github-actions bot commented Dec 22, 2024

etcd controller shall not ignore ghost etcd member in health check #11231

etcd controller shall not ignore ghost etcd member in health check #11231

Comments

mogliang commented Nov 5, 2024 • edited Loading

brandond commented Nov 5, 2024

mogliang commented Nov 5, 2024

brandond commented Nov 5, 2024

mogliang commented Nov 6, 2024

brandond commented Nov 6, 2024 • edited Loading

mogliang commented Nov 7, 2024

brandond commented Nov 7, 2024

github-actions bot commented Dec 22, 2024

mogliang commented Nov 5, 2024 •

edited

Loading

brandond commented Nov 6, 2024 •

edited

Loading