-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address CNI installation race condition #11747
Comments
I also found cases of nodes that were broken due to the wrong permission of the calico-ipam binary, separately from the permission of the calico binary. The error is the same either way. Including this, all 3 nodes that I recently built in a cluster with scale.yml were broken. So it seem pretty serious; the race condition is not necessarily rare at all, possibly depending on the environment somehow.
|
I thought this might have been caused by #10407, but actually we experienced this problem in Kubespray v2.22.2 which does not have that. |
So we have multiples things:
|
Pretty short with bandwith right now, but I think we should go towards installing the cni plugins in kubespray, not by calico. initContainers:
- command:
- /opt/cni/bin/install
env:
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
key: cni_network_config
name: calico-config
- name: CNI_CONF_NAME
value: 10-calico.conflist
- name: UPDATE_CNI_BINARIES
value: "true"
- name: SLEEP
value: "false"
- name: ETCD_ENDPOINTS
valueFrom:
configMapKeyRef:
key: etcd_endpoints
name: calico-config
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: quay.io/calico/cni:v3.27.4
imagePullPolicy: IfNotPresent
name: install-cni
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
- mountPath: /host/opt/cni/bin
name: cni-bin-dir This feels wrong somehow 🤔 |
Potential pitfall with the "kubespray install the cni" solution:
|
I agree a Daemonset with hostPath to install binaries is a bit odd. Actually the alternative method in Calico documentation (YAML manifest) has that same daemonset. You can also use Helm: https://docs.tigera.io/calico/latest/getting-started/kubernetes/helm |
As much as I'd like to switch to Helm or anything else which help us not maintain manifests in Kubespray directly, I don't think it's happening right now
(the helm role has a bad interface (I wrote it :( ) which is not very well suited to kubespray, using role parameters instead of inventory).
On second thought, the path of least pain for fixing this quickly might be to follow upstream, and let the calico ds install its CNI plugin.
While it still **is** odd, it's not the end of the world either.
|
What happened?
The issue is described in #10928 however it was closed with a workaround that does not really resolve the underlying issue. In particular that PR does not change the default behaviour so scale.yml will still deploy broken nodes.
I think there are a few things to sort out to get to the bottom of this.
recurse: true
. (I vaguely recall very old versions of Ansible might have neededrecurse: true
to create parent directories, i.e. /opt/cni in this case?But modern versions of Ansible take care of that automatically whenstate: directory
)./opt/cni/bin
at all. The Calico daemonset uses a hostPath volumeMount of /opt/cni/bin, and will create that directory if it is missing: https://github.com/kubernetes-sigs/kubespray/blob/master/roles/network_plugin/calico/templates/calico-node.yml.j2#L448 The comment says this is "Used to install CNI". But then ansible extracts a tarball on top of that location afterwards? https://github.com/kubernetes-sigs/kubespray/blob/v2.26.0/roles/network_plugin/cni/tasks/main.yml#L13 Seems messy. Is the CNI installed by the Daemonset via hostPath, or tarball via Ansible, or both?grep hostPath -A 1 -r kubespray/roles/network_plugin | grep /opt/cni/bin
it looks like most or all of the CNI plugins are installed via hostPath so I'm not sure how specific this is to Calico, or could affect other CNIs too.Maybe someone who knows more about the CNI installation could help indicate how to solve this.
What did you expect to happen?
#10928 would be fixed.
How can we reproduce it (as minimally and precisely as possible)?
See #10928
OS
Almalinux 8
Version of Ansible
N/A
Version of Python
N/A
Version of Kubespray (commit)
v2.22.2
Network plugin used
calico
Full inventory with variables
N/A
Command used to invoke ansible
See #10928
Output of ansible run
See #10928
Anything else we need to know
No response
The text was updated successfully, but these errors were encountered: