Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address CNI installation race condition #11747

Open
rptaylor opened this issue Nov 25, 2024 · 8 comments · May be fixed by #11780
Open

Address CNI installation race condition #11747

rptaylor opened this issue Nov 25, 2024 · 8 comments · May be fixed by #11780
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@rptaylor
Copy link
Contributor

rptaylor commented Nov 25, 2024

What happened?

The issue is described in #10928 however it was closed with a workaround that does not really resolve the underlying issue. In particular that PR does not change the default behaviour so scale.yml will still deploy broken nodes.

I think there are a few things to sort out to get to the bottom of this.

  1. This says "make sure /opt/cni/bin exists". But if it is only meant to create the directories, it should not have recurse: true. (I vaguely recall very old versions of Ansible might have needed recurse: true to create parent directories, i.e. /opt/cni in this case?But modern versions of Ansible take care of that automatically when state: directory ).
  2. However the bigger question is why is Ansible touching /opt/cni/bin at all. The Calico daemonset uses a hostPath volumeMount of /opt/cni/bin, and will create that directory if it is missing: https://github.com/kubernetes-sigs/kubespray/blob/master/roles/network_plugin/calico/templates/calico-node.yml.j2#L448 The comment says this is "Used to install CNI". But then ansible extracts a tarball on top of that location afterwards? https://github.com/kubernetes-sigs/kubespray/blob/v2.26.0/roles/network_plugin/cni/tasks/main.yml#L13 Seems messy. Is the CNI installed by the Daemonset via hostPath, or tarball via Ansible, or both?
  3. Based on grep hostPath -A 1 -r kubespray/roles/network_plugin | grep /opt/cni/bin it looks like most or all of the CNI plugins are installed via hostPath so I'm not sure how specific this is to Calico, or could affect other CNIs too.

Maybe someone who knows more about the CNI installation could help indicate how to solve this.

What did you expect to happen?

#10928 would be fixed.

How can we reproduce it (as minimally and precisely as possible)?

See #10928

OS

Almalinux 8

Version of Ansible

N/A

Version of Python

N/A

Version of Kubespray (commit)

v2.22.2

Network plugin used

calico

Full inventory with variables

N/A

Command used to invoke ansible

See #10928

Output of ansible run

See #10928

Anything else we need to know

No response

@rptaylor rptaylor added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2024
@rptaylor
Copy link
Contributor Author

I also found cases of nodes that were broken due to the wrong permission of the calico-ipam binary, separately from the permission of the calico binary. The error is the same either way. Including this, all 3 nodes that I recently built in a cluster with scale.yml were broken. So it seem pretty serious; the race condition is not necessarily rare at all, possibly depending on the environment somehow.

-rwsr-xr-x. 1 kube root 59136424 Nov 21 13:23 /opt/cni/bin/calico-ipam

@rptaylor
Copy link
Contributor Author

rptaylor commented Nov 26, 2024

I thought this might have been caused by #10407, but actually we experienced this problem in Kubespray v2.22.2 which does not have that.

@rptaylor rptaylor changed the title Address CNI race condition in scale.yml Address CNI race condition Nov 26, 2024
@rptaylor rptaylor changed the title Address CNI race condition Address CNI installation race condition Nov 26, 2024
@VannTen
Copy link
Contributor

VannTen commented Nov 29, 2024

So we have multiples things:

  1. The recurse is wrong, and the whole owner thing seems pretty dubious, when looking at git history.
  2. Who should be responsible for installting the cni-plugins ? -> IMO, as the cni are shared between network plugins (AFAIU) this should be kubespray responsability. I assume though that our calico manifests are pulled from upstream (== calico) so maybe we should see what they think. -> calico cni is not the same, as are others network plugins CNI (but not all ? 🤔 )

@VannTen
Copy link
Contributor

VannTen commented Dec 5, 2024

Pretty short with bandwith right now, but I think we should go towards installing the cni plugins in kubespray, not by calico.

 initContainers:
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: UPDATE_CNI_BINARIES
          value: "true"
        - name: SLEEP
          value: "false"
        - name: ETCD_ENDPOINTS
          valueFrom:
            configMapKeyRef:
              key: etcd_endpoints
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: quay.io/calico/cni:v3.27.4
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir

This feels wrong somehow 🤔

@VannTen
Copy link
Contributor

VannTen commented Dec 5, 2024

Potential pitfall with the "kubespray install the cni" solution:

  • version mismatch ?

@rptaylor
Copy link
Contributor Author

rptaylor commented Dec 5, 2024

I agree a Daemonset with hostPath to install binaries is a bit odd.
The Calico recommended way is to use the operator: https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises

Actually the alternative method in Calico documentation (YAML manifest) has that same daemonset.

You can also use Helm: https://docs.tigera.io/calico/latest/getting-started/kubernetes/helm
Personally I think that way sounds best. There is a helm role in Kubespray.

@VannTen
Copy link
Contributor

VannTen commented Dec 6, 2024 via email

@VannTen VannTen linked a pull request Dec 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants