Caution: (almost) lost all control plane nodes during a nightly OS update due to selinux
-related boot failures
#1583
Replies: 2 comments 1 reply
-
Yes, I have the same feeling. We had similar problems after automated upgrades on two clusters (test & production) some time ago, so we disabled MicroOS upgrades completely and do them semi-automatically from time to time. But of course that's not the way it should be. And what the README tells us ;). Would it be possible to use the "Immutable Server Release" instead of Tumbleweed? Although SuSE says it contains the latest tested stable software, it obviously does not have the quality level for production use. |
Beta Was this translation helpful? Give feedback.
-
I think it's not so easy to switch to arbitrary ISOs/OSes, because a lot of kube-hetzner internals probably depend on MicroOS specifics. And I have to admit that after some research it looks like there is no such thing as an "Immutable Server Release". There is a link of sorts on the MicroOS portal page, but in fact I could not find any real code for this distribution. What does exist is a "slowroll" release, which sounds promising but is still experimental. At the moment it can't be used as a replacement for kube-hetzner image snapshot generation, because it lacks a qcow2 artifact and a "ContainerHost" build, which is currently used of the Tumbleweed release. But there is a recipe for turning a running Tumbleweed node on the "slowroll" release. I might try that. |
Beta Was this translation helpful? Give feedback.
-
This post serves as a post-mortem of today’s events. I came dangerously close to losing all control plane nodes and had, in fact, already resigned myself to the loss at one point.
I’m unsure if this issue might also affect others here, but I suspect it could, as I didn’t apply any custom configurations to my control plane nodes.
Background
In line with the module’s configuration,
kured
was running to apply updates to nodes and reboot them as necessary.During the night,
kured
updated the nodes and initiated reboots. However, it became evident that the nodes were failing to boot properly, getting stuck in “emergency mode” (as observed via the Hetzner console).This issue affected 2 out of 3 control nodes, leaving the cluster headless: all services were down,
kubectl
access was unavailable due to the API being offline, and the recoverability of etcd was uncertain at that time.Boot issue
As the boot log runs through so quickly without the option to scroll back, I first had to make a video to be able to see what is causing the failure.
Solution
I booted into the rescue system, chrooted into the affected nodes, and started investigating the issue. It turned out that SELinux was attempting relabeling actions, which seemed to be causing the problem. After setting
SELINUX=disabled
in/etc/selinux/config
and rebooting, the nodes booted normally again.Fortunately, I was able to restart the
k3s
server, and everything recovered successfully. I was also lucky not to encounter any volume disruptions or other problems related to HA consensus, even though two worker nodes were affected as well.I’m still a bit shaken by how something like this could happen. I assume
SELINUX=enforcing
had been set all along without causing any issues, yet a routine OS update resulted in a persistent boot failure. This feels quite alarming to me.Since I haven’t identified the root cause yet, I currently suspect the issue is related to the MicroOS update to version 20241202. If that’s the case, it could potentially affect other users as well unless action is taken.
On a more positive note (despite the overall oddity), I was able to re-enable
SELINUX=enforcing
after resolving the relabeling issues on some nodes and performing a clean boot. This doesn’t make much sense to me, given that the same issue occurred consistently across all nodes that updated and rebooted between yesterday and today. 👀Others still fail though and must now run with selinux disabled.
Beta Was this translation helpful? Give feedback.
All reactions