Address CNI installation race condition #11747

rptaylor · 2024-11-25T22:29:36Z

What happened?

The issue is described in #10928 however it was closed with a workaround that does not really resolve the underlying issue. In particular that PR does not change the default behaviour so scale.yml will still deploy broken nodes.

I think there are a few things to sort out to get to the bottom of this.

This says "make sure /opt/cni/bin exists". But if it is only meant to create the directories, it should not have recurse: true. (I vaguely recall very old versions of Ansible might have needed recurse: true to create parent directories, i.e. /opt/cni in this case?But modern versions of Ansible take care of that automatically when state: directory ).
However the bigger question is why is Ansible touching /opt/cni/bin at all. The Calico daemonset uses a hostPath volumeMount of /opt/cni/bin, and will create that directory if it is missing: https://github.com/kubernetes-sigs/kubespray/blob/master/roles/network_plugin/calico/templates/calico-node.yml.j2#L448 The comment says this is "Used to install CNI". But then ansible extracts a tarball on top of that location afterwards? https://github.com/kubernetes-sigs/kubespray/blob/v2.26.0/roles/network_plugin/cni/tasks/main.yml#L13 Seems messy. Is the CNI installed by the Daemonset via hostPath, or tarball via Ansible, or both?
Based on grep hostPath -A 1 -r kubespray/roles/network_plugin | grep /opt/cni/bin it looks like most or all of the CNI plugins are installed via hostPath so I'm not sure how specific this is to Calico, or could affect other CNIs too.

Maybe someone who knows more about the CNI installation could help indicate how to solve this.

What did you expect to happen?

#10928 would be fixed.

How can we reproduce it (as minimally and precisely as possible)?

See #10928

OS

Almalinux 8

Version of Ansible

N/A

Version of Python

N/A

Version of Kubespray (commit)

v2.22.2

Network plugin used

calico

Full inventory with variables

N/A

Command used to invoke ansible

See #10928

Output of ansible run

See #10928

Anything else we need to know

No response

The text was updated successfully, but these errors were encountered:

rptaylor · 2024-11-25T22:55:01Z

I also found cases of nodes that were broken due to the wrong permission of the calico-ipam binary, separately from the permission of the calico binary. The error is the same either way. Including this, all 3 nodes that I recently built in a cluster with scale.yml were broken. So it seem pretty serious; the race condition is not necessarily rare at all, possibly depending on the environment somehow.

-rwsr-xr-x. 1 kube root 59136424 Nov 21 13:23 /opt/cni/bin/calico-ipam

rptaylor · 2024-11-26T00:05:43Z

I thought this might have been caused by #10407, but actually we experienced this problem in Kubespray v2.22.2 which does not have that.

VannTen · 2024-11-29T08:08:39Z

So we have multiples things:

The recurse is wrong, and the whole owner thing seems pretty dubious, when looking at git history.
Who should be responsible for installting the cni-plugins ? -> IMO, as the cni are shared between network plugins (AFAIU) this should be kubespray responsability. I assume though that our calico manifests are pulled from upstream (== calico) so maybe we should see what they think. -> calico cni is not the same, as are others network plugins CNI (but not all ? 🤔 )

VannTen · 2024-11-29T08:18:32Z

Relevant :

VannTen · 2024-12-05T10:50:25Z

Pretty short with bandwith right now, but I think we should go towards installing the cni plugins in kubespray, not by calico.

 initContainers:
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: UPDATE_CNI_BINARIES
          value: "true"
        - name: SLEEP
          value: "false"
        - name: ETCD_ENDPOINTS
          valueFrom:
            configMapKeyRef:
              key: etcd_endpoints
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: quay.io/calico/cni:v3.27.4
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir

This feels wrong somehow 🤔

VannTen · 2024-12-05T11:03:11Z

Potential pitfall with the "kubespray install the cni" solution:

version mismatch ?

rptaylor · 2024-12-05T18:58:34Z

I agree a Daemonset with hostPath to install binaries is a bit odd.
The Calico recommended way is to use the operator: https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises

Actually the alternative method in Calico documentation (YAML manifest) has that same daemonset.

You can also use Helm: https://docs.tigera.io/calico/latest/getting-started/kubernetes/helm
Personally I think that way sounds best. There is a helm role in Kubespray.

VannTen · 2024-12-06T09:09:07Z

As much as I'd like to switch to Helm or anything else which help us not maintain manifests in Kubespray directly, I don't think it's happening right now (the helm role has a bad interface (I wrote it :( ) which is not very well suited to kubespray, using role parameters instead of inventory). On second thought, the path of least pain for fixing this quickly might be to follow upstream, and let the calico ds install its CNI plugin. While it still **is** odd, it's not the end of the world either.

rptaylor added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2024

rptaylor mentioned this issue Nov 25, 2024

scale.yml race condition causing calico networking to malfunction #10928

Closed

rptaylor changed the title ~~Address CNI race condition in scale.yml~~ Address CNI race condition Nov 26, 2024

rptaylor changed the title ~~Address CNI race condition~~ Address CNI installation race condition Nov 26, 2024

VannTen linked a pull request Dec 10, 2024 that will close this issue

WIP: Don't let calico modify CNI plugins #11780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address CNI installation race condition #11747

Address CNI installation race condition #11747

rptaylor commented Nov 25, 2024 •

edited

Loading

rptaylor commented Nov 25, 2024

rptaylor commented Nov 26, 2024 •

edited

Loading

VannTen commented Nov 29, 2024 •

edited

Loading

VannTen commented Nov 29, 2024

VannTen commented Dec 5, 2024

VannTen commented Dec 5, 2024

rptaylor commented Dec 5, 2024

VannTen commented Dec 6, 2024 via email

Address CNI installation race condition #11747

Address CNI installation race condition #11747

Comments

rptaylor commented Nov 25, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

rptaylor commented Nov 25, 2024

rptaylor commented Nov 26, 2024 • edited Loading

VannTen commented Nov 29, 2024 • edited Loading

VannTen commented Nov 29, 2024

VannTen commented Dec 5, 2024

VannTen commented Dec 5, 2024

rptaylor commented Dec 5, 2024

VannTen commented Dec 6, 2024 via email

rptaylor commented Nov 25, 2024 •

edited

Loading

rptaylor commented Nov 26, 2024 •

edited

Loading

VannTen commented Nov 29, 2024 •

edited

Loading