Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium 1.16.5 breaks external DNS resolution with forwardKubeDNSToHost enabled #10002

Open
kenlasko opened this issue Dec 20, 2024 · 4 comments

Comments

@kenlasko
Copy link

kenlasko commented Dec 20, 2024

Bug Report

After upgrading my two Talos clusters to Cilium 1.16.5, I immediately started having external DNS resolution issues on one cluster. CoreDNS started throwing these errors, and things quickly started going sideways:

[INFO] 10.244.0.38:41485 - 15314 "A IN hooks.slack.com. udp 33 false 512" - - 0 2.000258485s
[ERROR] plugin/errors: 2 hooks.slack.com. A: read udp 10.244.0.27:54684->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.38:41485 - 48588 "AAAA IN hooks.slack.com. udp 33 false 512" - - 0 2.000263845s
[ERROR] plugin/errors: 2 hooks.slack.com. AAAA: read udp 10.244.0.27:34832->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.38:41485 - 15314 "A IN hooks.slack.com. udp 33 false 512" - - 0 2.001438855s 

Reverting back to 1.16.4 made the problem go away. I posted this on the Cilium issues board as #36737, where other people with Talos starting piping in with similar stories.

sfackler noted:

The Talos dns-resolve-cache logs show that it is receiving the requests and resolving them successfully, so it seems like the response just isn't making it back to the CoreDNS pod.

I did some digging around the Talos DNS docs and noticed the cluster with issues was created with Talos 1.8.0 or higher, while the other one was created long before 1.8.0. As such, forwardKubeDNSToHost was enabled by default on the problem cluster, while the other does not have it enabled.

I patched the problem cluster with:

machine:
  features:
    hostDNS:
      enabled: true
      forwardKubeDNSToHost: false

After restarting CoreDNS, the problem immediately went away.

Since forwardKubeDNSToHost is a default option now, I suspect others may come across this issue, so its probably best to get to the bottom of it. Unsure if its a Talos problem or Cilium.

Environment

  • Talos version: 1.9.0
  • Kubernetes version: 1.32.0
  • Platform: ARM64 and AMD64
@smira
Copy link
Member

smira commented Dec 20, 2024

It is certainly a Cilium issue which decides not to deliver the packet which perfectly valid.

@kenlasko
Copy link
Author

As per cilium/cilium#36737 (comment), Cilium now uses BPF Host Routing in 1.16.5, which is conflicting with forwardKubeDNSToHost in Talos. Setting bpf.hostLegacyRouting=true in your Cilium values.yaml reverts to the behaviour used in 1.16.4 and earlier. This eliminates the need for disabling forwardKubeDNSToHost in Talos.

Not sure who's really at fault here or what should be done next.

@smira
Copy link
Member

smira commented Dec 24, 2024

So once again as with many same issues reported before, there is only a problem in the non-default setup of Cilium.

First of all, even latest cilium CLI defaults to Cilium v1.16.4.

Second, with more or less defaults:

cilium install \
                                                                       --set=ipam.mode=kubernetes \
                                                                       --set=kubeProxyReplacement=true \
                                                                       --set=securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
                                                                       --set=securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
                                                                       --set=cgroup.autoMount.enabled=false \
                                                                       --set=cgroup.hostRoot=/sys/fs/cgroup \
                                                                       --set=k8sServiceHost=localhost \
                                                                       --set=k8sServicePort=7445 --version=v1.16.5 

The issue isn't there.

One way to trigger it is to actually keep enabling Cilium non-default settings, the one I found is --set=bpf.masquerade=true.

So please when reporting issues, specify your configuration.

Second, this is not Talos-specific, e.g. cilium/cilium#36761

If there's something we could help Cilium with, we would be happy to, but Talos setup is perfectly valid.

@PhilipSchmid
Copy link

PhilipSchmid commented Jan 6, 2025

@smira I guess we could close this issue, no?

IMO, it's a known incompatibility between two product-specific optimizations (Cilium BPF routing & Talos' Host DNS), and a workaround is known. Soon, it's also documented in the Cilium docs: cilium/cilium#36852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants