Deployment fails when etcd servers are not members of kube_control_plane #11682

jctoussaint · 2024-11-02T13:19:09Z

What happened?

The task Gen_certs | Gather node certs fails with this message:

ok: [k8s-mst1 -> k8s-etcd1(192.168.0.21)]
ok: [k8s-mst2 -> k8s-etcd1(192.168.0.21)]
fatal: [k8s-worker1 -> k8s-etcd1(192.168.0.21)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-worker1.pem node-k8s-worker1-key.pem | base64 --wrap=0", "delta": "0:00:00.048485", "end": "2024-11-02 11:57:31.981817", "msg": "non-zero return code", "rc": 2, "start": "2024-11-02 11:57:31.933332", "stderr": "tar: node-k8s-worker1.pem : stat impossible: Aucun fichier ou dossier de ce type

In k8s-worker1 nor k8s-etcd1, the files node-k8s-worker1.pem and node-k8s-worker1-key.pem don't exist.

What did you expect to happen?

In k8s-etcd1, the files node-k8s-worker1.pem and node-k8s-worker1-key.pem must exist.

How can we reproduce it (as minimally and precisely as possible)?

With 3 etcd dedicated servers.

Deploy with this command:

source ~/ansible-kubespray/bin/activate
cd kubespray
ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True'

OS

Linux 6.1.0-26-amd64 x86_64
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

Version of Ansible

ansible [core 2.16.12]
config file = /home/me/kubespray/ansible.cfg
configured module search path = ['/home/me/kubespray/library']
ansible python module location = /home/me/ansible-kubespray/lib/python3.11/site-packages/ansible
ansible collection location = /home/me/.ansible/collections:/usr/share/ansible/collections
executable location = /home/me/ansible-kubespray/bin/ansible
python version = 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] (/home/me/ansible-kubespray/bin/python3)
jinja version = 3.1.4
libyaml = True

Version of Python

Python 3.11.2

Version of Kubespray (commit)

e5bdb3b

Network plugin used

cilium

Full inventory with variables

[all]
k8s-mst1    ansible_host=192.168.0.11
k8s-mst2    ansible_host=192.168.0.12
k8s-etcd1   ansible_host=192.168.0.21 etcd_member_name=etcd1
k8s-etcd2   ansible_host=192.168.0.22 etcd_member_name=etcd2
k8s-etcd3   ansible_host=192.168.0.23 etcd_member_name=etcd3
k8s-worker1 ansible_host=192.168.0.31
k8s-worker2 ansible_host=192.168.0.32
k8s-worker3 ansible_host=192.168.0.33

[kube_control_plane]
k8s-mst1
k8s-mst2

[etcd]
k8s-etcd1
k8s-etcd2
k8s-etcd3

[kube_node]
k8s-worker1
k8s-worker2
k8s-worker3

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

Command used to invoke ansible

ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True'

Output of ansible run

ok: [k8s-mst1 -> k8s-etcd1(192.168.0.21)]
ok: [k8s-mst2 -> k8s-etcd1(192.168.0.21)]
fatal: [k8s-worker1 -> k8s-etcd1(192.168.0.21)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-worker1.pem node-k8s-worker1-key.pem | base64 --wrap=0", "delta": "0:00:00.048485", "end": "2024-11-02 11:57:31.981817", "msg": "non-zero return code", "rc": 2, "start": "2024-11-02 11:57:31.933332", "stderr": "tar: node-k8s-worker1.pem : stat impossible: Aucun fichier ou dossier de ce type

Anything else we need to know

I fixed this issue like this:

create the workers certificates on k8s-etcd1:

# on k8s-etcd1
HOSTS=k8s-worker1 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
HOSTS=k8s-worker2 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
HOSTS=k8s-worker3 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/

deploy only etcd (w/ --tags=etcd):

ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True' --tags=etcd

restart the deployment without --tags=etcd

The text was updated successfully, but these errors were encountered:

VannTen · 2024-11-09T13:40:15Z

Is that reproducible with a setup like this:

[kube_control_plane]
node-1

[etcd]
node-1
node-2
node-3

[kube_node]
node-1
node-2
node-3
node-4

?

(This is the node-etcd-client setup which is tested in CI, so if it does not catch that kind of things we need to tweak it)

jctoussaint · 2024-11-10T16:58:15Z

I'll test it.

But I think it will work because node-1 is in kube_control_plane and etcd.

jctoussaint · 2024-11-10T21:03:30Z

It worked on the first try:

PLAY RECAP *****************************************************************************************************************
k8s-test1                  : ok=697  changed=154  unreachable=0    failed=0    skipped=1084 rescued=0    ignored=3   
k8s-test2                  : ok=561  changed=121  unreachable=0    failed=0    skipped=673  rescued=0    ignored=2   
k8s-test3                  : ok=561  changed=121  unreachable=0    failed=0    skipped=673  rescued=0    ignored=2   
k8s-test4                  : ok=512  changed=104  unreachable=0    failed=0    skipped=669  rescued=0    ignored=1

VannTen · 2024-11-13T18:25:01Z

Hum it looks like the conditions are: - Separate etcd / master - nodes are etcd clients (eg, calico using etcd store) - maybe node != control plane ? Not sure about this one That'd be helpful if you can test that, otherwise I'll start a PR with that as new test case when I can

jctoussaint · 2024-11-17T09:48:23Z

Something like this?

[all]
k8s-test1   ansible_host=192.168.0.31
k8s-test2   ansible_host=192.168.0.32 etcd_member_name=etcd1
k8s-test3   ansible_host=192.168.0.33 etcd_member_name=etcd2
k8s-test4   ansible_host=192.168.0.34 etcd_member_name=etcd3

[kube_control_plane]
k8s-test1

[etcd]
k8s-test2
k8s-test3
k8s-test4

[kube_node]
k8s-test2
k8s-test3
k8s-test4

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

VannTen · 2024-11-17T10:20:51Z

I was more thinking something like that ```ini [kube_control_plane] host1 [etcd] host2 [kube_node] host3 [all:vars] network_plugin=calico ``` (If HA is not required to trigger the bug, this makes the test less expensive in CI time) (Btw, explicit k8s_cluster is no longer required, it's dynamicly defined to the union of control-plane and node)

jctoussaint · 2024-11-17T11:43:29Z

OK, i'll try it.

jctoussaint · 2024-11-17T11:51:23Z

(Btw, explicit k8s_cluster is no longer required, it's dynamicly defined to the union of control-plane and node)

I tried, but I think there is an issue if k8s_cluster does not exist :

TASK [kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames] *****************************************
fatal: [k8s-test2 -> localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'k8s_cluster'. 'dict object' has no attribute 'k8s_cluster'\n\nThe error appears to be in '/home/me/kubespray/roles/kubespray-defaults/tasks/no_proxy.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Set no_proxy to all assigned cluster IPs and hostnames\n  ^ here\n"}

I'll try to restore k8s_cluster group.

jctoussaint · 2024-11-17T12:28:32Z

I think you got it -> it fails:

TASK [etcd : Gen_certs | Gather node certs] ********************************************************************************
ok: [k8s-test1 -> k8s-test2(192.168.0.32)]
fatal: [k8s-test3 -> k8s-test2(192.168.0.32)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-test3.pem node-k8s-test3-key.pem | base64 --wrap=0", "delta": "0:00:00.007001", "end": "2024-11-17 13:02:35.550815", "msg": "non-zero return code", "rc": 2, "start": "2024-11-17 13:02:35.543814", "stderr": "tar: node-k8s-test3.pem : stat impossible: Aucun fichier ou dossier de ce type\ntar: node-k8s-test3-key.pem : stat impossible: Aucun fichier ou dossier de ce type\ntar: Arrêt avec code d'échec à cause des erreurs précédentes", "stderr_lines": ["tar: node-k8s-test3.pem : stat impossible: Aucun fichier ou dossier de ce type", "tar: node-k8s-test3-key.pem : stat impossible: Aucun fichier ou dossier de ce type", "tar: Arrêt avec code d'échec à cause des erreurs précédentes"], "stdout": "H4sIAA....AKAAA"]}

.. w/ this inventory:

[all]
k8s-test1   ansible_host=192.168.0.31
k8s-test2   ansible_host=192.168.0.32
k8s-test3   ansible_host=192.168.0.33

[kube_control_plane]
k8s-test1

[etcd]
k8s-test2

[kube_node]
k8s-test3

[all:vars]
network_plugin=calico

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

VannTen · 2024-11-17T14:45:27Z

Great, thanks for testing that ! We'll need to add that to the CI in the PR to fix this so it does not regress again.

VannTen · 2024-12-12T10:30:14Z

Hum, this is a bit weird, I apparently can't reproduce this on master (and I don't see what could have fixed this 🤔 )
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/8622090173

VannTen · 2024-12-12T10:30:42Z

Do you still have the issue if you use that inventory with latest master ?

VannTen · 2024-12-12T12:04:54Z

(Or latest release-2.26 for that matter I can't reproduce it either on the top of the branch 😞

VannTen · 2024-12-22T08:21:07Z

/triage not-reproducible (At least I can't for now)

jctoussaint added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2024

VannTen linked a pull request Dec 12, 2024 that will close this issue

Fix separate etcd with calico #11789

Draft

k8s-ci-robot added the triage/not-reproducible Indicates an issue can not be reproduced as described. label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment fails when etcd servers are not members of kube_control_plane #11682

Deployment fails when etcd servers are not members of kube_control_plane #11682

jctoussaint commented Nov 2, 2024 •

edited

Loading

VannTen commented Nov 9, 2024

jctoussaint commented Nov 10, 2024

jctoussaint commented Nov 10, 2024

VannTen commented Nov 13, 2024 via email

jctoussaint commented Nov 17, 2024 •

edited

Loading

VannTen commented Nov 17, 2024 via email

jctoussaint commented Nov 17, 2024

jctoussaint commented Nov 17, 2024

jctoussaint commented Nov 17, 2024

VannTen commented Nov 17, 2024 via email

VannTen commented Dec 12, 2024

VannTen commented Dec 12, 2024

VannTen commented Dec 12, 2024

VannTen commented Dec 22, 2024 via email

Deployment fails when etcd servers are not members of kube_control_plane #11682

Deployment fails when etcd servers are not members of kube_control_plane #11682

Comments

jctoussaint commented Nov 2, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

VannTen commented Nov 9, 2024

jctoussaint commented Nov 10, 2024

jctoussaint commented Nov 10, 2024

VannTen commented Nov 13, 2024 via email

jctoussaint commented Nov 17, 2024 • edited Loading

VannTen commented Nov 17, 2024 via email

jctoussaint commented Nov 17, 2024

jctoussaint commented Nov 17, 2024

jctoussaint commented Nov 17, 2024

VannTen commented Nov 17, 2024 via email

VannTen commented Dec 12, 2024

VannTen commented Dec 12, 2024

VannTen commented Dec 12, 2024

VannTen commented Dec 22, 2024 via email

jctoussaint commented Nov 2, 2024 •

edited

Loading

jctoussaint commented Nov 17, 2024 •

edited

Loading