Caution: (almost) lost all control plane nodes during a nightly OS update due to `selinux`-related boot failures #1583

pat-s · 2024-12-04T17:31:29Z

pat-s
Dec 4, 2024

This post serves as a post-mortem of today’s events. I came dangerously close to losing all control plane nodes and had, in fact, already resigned myself to the loss at one point.

I’m unsure if this issue might also affect others here, but I suspect it could, as I didn’t apply any custom configurations to my control plane nodes.

Background

In line with the module’s configuration, kured was running to apply updates to nodes and reboot them as necessary.

During the night, kured updated the nodes and initiated reboots. However, it became evident that the nodes were failing to boot properly, getting stuck in “emergency mode” (as observed via the Hetzner console).

This issue affected 2 out of 3 control nodes, leaving the cluster headless: all services were down, kubectl access was unavailable due to the API being offline, and the recoverability of etcd was uncertain at that time.

Boot issue

As the boot log runs through so quickly without the option to scroll back, I first had to make a video to be able to see what is causing the failure.

Solution

I booted into the rescue system, chrooted into the affected nodes, and started investigating the issue. It turned out that SELinux was attempting relabeling actions, which seemed to be causing the problem. After setting SELINUX=disabled in /etc/selinux/config and rebooting, the nodes booted normally again.

Fortunately, I was able to restart the k3s server, and everything recovered successfully. I was also lucky not to encounter any volume disruptions or other problems related to HA consensus, even though two worker nodes were affected as well.

I’m still a bit shaken by how something like this could happen. I assume SELINUX=enforcing had been set all along without causing any issues, yet a routine OS update resulted in a persistent boot failure. This feels quite alarming to me.

Since I haven’t identified the root cause yet, I currently suspect the issue is related to the MicroOS update to version 20241202. If that’s the case, it could potentially affect other users as well unless action is taken.

On a more positive note (despite the overall oddity), I was able to re-enable SELINUX=enforcing after resolving the relabeling issues on some nodes and performing a clean boot. This doesn’t make much sense to me, given that the same issue occurred consistently across all nodes that updated and rebooted between yesterday and today. 👀
Others still fail though and must now run with selinux disabled.

jr-dimedis · 2024-12-16T15:59:15Z

jr-dimedis
Dec 16, 2024

Yes, I have the same feeling. We had similar problems after automated upgrades on two clusters (test & production) some time ago, so we disabled MicroOS upgrades completely and do them semi-automatically from time to time.

But of course that's not the way it should be. And what the README tells us ;).

Would it be possible to use the "Immutable Server Release" instead of Tumbleweed? Although SuSE says it contains the latest tested stable software, it obviously does not have the quality level for production use.

1 reply

pat-s Dec 16, 2024
Author

I think making the ISO choice more flexible would be the overall best approach. I.e people can just use any OS they want (e.g. Talos as an alternative) without having to switch to another module directly like https://github.com/hcloud-talos/terraform-hcloud-talos.

WRT to the issue: still don't know what caused it and why it happened - and why it only happened for /usr/local/.
Somewhat happy that this doesn't seem to be a generic issue hitting all users (as otherwise more would have been commenting here) but on the flipside super strange that I faced it contentiously an all nodes.

jr-dimedis · 2024-12-17T10:44:29Z

jr-dimedis
Dec 17, 2024

I think it's not so easy to switch to arbitrary ISOs/OSes, because a lot of kube-hetzner internals probably depend on MicroOS specifics.

And I have to admit that after some research it looks like there is no such thing as an "Immutable Server Release". There is a link of sorts on the MicroOS portal page, but in fact I could not find any real code for this distribution.

What does exist is a "slowroll" release, which sounds promising but is still experimental. At the moment it can't be used as a replacement for kube-hetzner image snapshot generation, because it lacks a qcow2 artifact and a "ContainerHost" build, which is currently used of the Tumbleweed release.

But there is a recipe for turning a running Tumbleweed node on the "slowroll" release. I might try that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caution: (almost) lost all control plane nodes during a nightly OS update due to `selinux`-related boot failures #1583

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Caution: (almost) lost all control plane nodes during a nightly OS update due to selinux-related boot failures #1583

pat-s Dec 4, 2024

Background

Boot issue

Solution

Replies: 2 comments · 1 reply

jr-dimedis Dec 16, 2024

pat-s Dec 16, 2024 Author

jr-dimedis Dec 17, 2024

Caution: (almost) lost all control plane nodes during a nightly OS update due to `selinux`-related boot failures #1583

pat-s
Dec 4, 2024

Replies: 2 comments 1 reply

jr-dimedis
Dec 16, 2024

pat-s Dec 16, 2024
Author

jr-dimedis
Dec 17, 2024