Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment fails when etcd servers are not members of kube_control_plane #11682

Open
jctoussaint opened this issue Nov 2, 2024 · 14 comments · May be fixed by #11789
Open

Deployment fails when etcd servers are not members of kube_control_plane #11682

jctoussaint opened this issue Nov 2, 2024 · 14 comments · May be fixed by #11789
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/not-reproducible Indicates an issue can not be reproduced as described.

Comments

@jctoussaint
Copy link

jctoussaint commented Nov 2, 2024

What happened?

The task Gen_certs | Gather node certs fails with this message:

ok: [k8s-mst1 -> k8s-etcd1(192.168.0.21)]
ok: [k8s-mst2 -> k8s-etcd1(192.168.0.21)]
fatal: [k8s-worker1 -> k8s-etcd1(192.168.0.21)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-worker1.pem node-k8s-worker1-key.pem | base64 --wrap=0", "delta": "0:00:00.048485", "end": "2024-11-02 11:57:31.981817", "msg": "non-zero return code", "rc": 2, "start": "2024-11-02 11:57:31.933332", "stderr": "tar: node-k8s-worker1.pem : stat impossible: Aucun fichier ou dossier de ce type

In k8s-worker1 nor k8s-etcd1, the files node-k8s-worker1.pem and node-k8s-worker1-key.pem don't exist.

What did you expect to happen?

In k8s-etcd1, the files node-k8s-worker1.pem and node-k8s-worker1-key.pem must exist.

How can we reproduce it (as minimally and precisely as possible)?

With 3 etcd dedicated servers.

Deploy with this command:

source ~/ansible-kubespray/bin/activate
cd kubespray
ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True'

OS

Linux 6.1.0-26-amd64 x86_64
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

Version of Ansible

ansible [core 2.16.12]
config file = /home/me/kubespray/ansible.cfg
configured module search path = ['/home/me/kubespray/library']
ansible python module location = /home/me/ansible-kubespray/lib/python3.11/site-packages/ansible
ansible collection location = /home/me/.ansible/collections:/usr/share/ansible/collections
executable location = /home/me/ansible-kubespray/bin/ansible
python version = 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] (/home/me/ansible-kubespray/bin/python3)
jinja version = 3.1.4
libyaml = True

Version of Python

Python 3.11.2

Version of Kubespray (commit)

e5bdb3b

Network plugin used

cilium

Full inventory with variables

[all]
k8s-mst1    ansible_host=192.168.0.11
k8s-mst2    ansible_host=192.168.0.12
k8s-etcd1   ansible_host=192.168.0.21 etcd_member_name=etcd1
k8s-etcd2   ansible_host=192.168.0.22 etcd_member_name=etcd2
k8s-etcd3   ansible_host=192.168.0.23 etcd_member_name=etcd3
k8s-worker1 ansible_host=192.168.0.31
k8s-worker2 ansible_host=192.168.0.32
k8s-worker3 ansible_host=192.168.0.33

[kube_control_plane]
k8s-mst1
k8s-mst2

[etcd]
k8s-etcd1
k8s-etcd2
k8s-etcd3

[kube_node]
k8s-worker1
k8s-worker2
k8s-worker3

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

Command used to invoke ansible

ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True'

Output of ansible run

ok: [k8s-mst1 -> k8s-etcd1(192.168.0.21)]
ok: [k8s-mst2 -> k8s-etcd1(192.168.0.21)]
fatal: [k8s-worker1 -> k8s-etcd1(192.168.0.21)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-worker1.pem node-k8s-worker1-key.pem | base64 --wrap=0", "delta": "0:00:00.048485", "end": "2024-11-02 11:57:31.981817", "msg": "non-zero return code", "rc": 2, "start": "2024-11-02 11:57:31.933332", "stderr": "tar: node-k8s-worker1.pem : stat impossible: Aucun fichier ou dossier de ce type

Anything else we need to know

I fixed this issue like this:

  1. create the workers certificates on k8s-etcd1:
# on k8s-etcd1
HOSTS=k8s-worker1 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
HOSTS=k8s-worker2 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
HOSTS=k8s-worker3 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
  1. deploy only etcd (w/ --tags=etcd):
ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True' --tags=etcd
  1. restart the deployment without --tags=etcd
@jctoussaint jctoussaint added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2024
@VannTen
Copy link
Contributor

VannTen commented Nov 9, 2024

Is that reproducible with a setup like this:

[kube_control_plane]
node-1

[etcd]
node-1
node-2
node-3

[kube_node]
node-1
node-2
node-3
node-4

?

(This is the node-etcd-client setup which is tested in CI, so if it does not catch that kind of things we need to tweak it)

@jctoussaint
Copy link
Author

I'll test it.

But I think it will work because node-1 is in kube_control_plane and etcd.

@jctoussaint
Copy link
Author

It worked on the first try:

PLAY RECAP *****************************************************************************************************************
k8s-test1                  : ok=697  changed=154  unreachable=0    failed=0    skipped=1084 rescued=0    ignored=3   
k8s-test2                  : ok=561  changed=121  unreachable=0    failed=0    skipped=673  rescued=0    ignored=2   
k8s-test3                  : ok=561  changed=121  unreachable=0    failed=0    skipped=673  rescued=0    ignored=2   
k8s-test4                  : ok=512  changed=104  unreachable=0    failed=0    skipped=669  rescued=0    ignored=1   

@VannTen
Copy link
Contributor

VannTen commented Nov 13, 2024 via email

@jctoussaint
Copy link
Author

jctoussaint commented Nov 17, 2024

Something like this?

[all]
k8s-test1   ansible_host=192.168.0.31
k8s-test2   ansible_host=192.168.0.32 etcd_member_name=etcd1
k8s-test3   ansible_host=192.168.0.33 etcd_member_name=etcd2
k8s-test4   ansible_host=192.168.0.34 etcd_member_name=etcd3

[kube_control_plane]
k8s-test1

[etcd]
k8s-test2
k8s-test3
k8s-test4

[kube_node]
k8s-test2
k8s-test3
k8s-test4

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

@VannTen
Copy link
Contributor

VannTen commented Nov 17, 2024 via email

@jctoussaint
Copy link
Author

OK, i'll try it.

@jctoussaint
Copy link
Author

(Btw, explicit k8s_cluster is no longer required, it's dynamicly defined to the union of control-plane and node)

I tried, but I think there is an issue if k8s_cluster does not exist :

TASK [kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames] *****************************************
fatal: [k8s-test2 -> localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'k8s_cluster'. 'dict object' has no attribute 'k8s_cluster'\n\nThe error appears to be in '/home/me/kubespray/roles/kubespray-defaults/tasks/no_proxy.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Set no_proxy to all assigned cluster IPs and hostnames\n  ^ here\n"}

I'll try to restore k8s_cluster group.

@jctoussaint
Copy link
Author

I think you got it -> it fails:

TASK [etcd : Gen_certs | Gather node certs] ********************************************************************************
ok: [k8s-test1 -> k8s-test2(192.168.0.32)]
fatal: [k8s-test3 -> k8s-test2(192.168.0.32)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-test3.pem node-k8s-test3-key.pem | base64 --wrap=0", "delta": "0:00:00.007001", "end": "2024-11-17 13:02:35.550815", "msg": "non-zero return code", "rc": 2, "start": "2024-11-17 13:02:35.543814", "stderr": "tar: node-k8s-test3.pem : stat impossible: Aucun fichier ou dossier de ce type\ntar: node-k8s-test3-key.pem : stat impossible: Aucun fichier ou dossier de ce type\ntar: Arrêt avec code d'échec à cause des erreurs précédentes", "stderr_lines": ["tar: node-k8s-test3.pem : stat impossible: Aucun fichier ou dossier de ce type", "tar: node-k8s-test3-key.pem : stat impossible: Aucun fichier ou dossier de ce type", "tar: Arrêt avec code d'échec à cause des erreurs précédentes"], "stdout": "H4sIAA....AKAAA"]}

.. w/ this inventory:

[all]
k8s-test1   ansible_host=192.168.0.31
k8s-test2   ansible_host=192.168.0.32
k8s-test3   ansible_host=192.168.0.33

[kube_control_plane]
k8s-test1

[etcd]
k8s-test2

[kube_node]
k8s-test3

[all:vars]
network_plugin=calico

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

@VannTen
Copy link
Contributor

VannTen commented Nov 17, 2024 via email

@VannTen VannTen linked a pull request Dec 12, 2024 that will close this issue
@VannTen
Copy link
Contributor

VannTen commented Dec 12, 2024

Hum, this is a bit weird, I apparently can't reproduce this on master (and I don't see what could have fixed this 🤔 )
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/8622090173

@VannTen
Copy link
Contributor

VannTen commented Dec 12, 2024

Do you still have the issue if you use that inventory with latest master ?

@VannTen
Copy link
Contributor

VannTen commented Dec 12, 2024

(Or latest release-2.26 for that matter I can't reproduce it either on the top of the branch 😞

@VannTen
Copy link
Contributor

VannTen commented Dec 22, 2024 via email

@k8s-ci-robot k8s-ci-robot added the triage/not-reproducible Indicates an issue can not be reproduced as described. label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/not-reproducible Indicates an issue can not be reproduced as described.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants