Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install systemd unit to run cleanup on shutdown or reboot #10431

Open
agracey opened this issue Jun 28, 2024 · 2 comments
Open

Install systemd unit to run cleanup on shutdown or reboot #10431

agracey opened this issue Jun 28, 2024 · 2 comments
Labels

Comments

@agracey
Copy link

agracey commented Jun 28, 2024

Is your feature request related to a problem? Please describe.

When rebooting a node, the system shows

[FAILED] Failed unmounting /etc
[Failed] Failed unmounting /var

for a few minutes before timing out and rebooting.

This is a problem for edge use-cases where you need to limit downtime of single-node clusters.

Describe the solution you'd like

I would like for the k3s installer to include a systemd unit file that runs k3s-killall.sh when a shutdown or reboot is requested. This has been discussed as a work around previously in: #7362 (comment)

Describe alternatives you've considered

  • Just waiting means an unpredictable outage
  • Running k3s-killall.sh manually means that any tooling that manages the system needs to know about it (which is error prone)
  • Writing the service yourself means that some users are going to not see the documentation and raise new issues
@pocketbroadcast
Copy link

I experienced the same issues when rebooting our nodes and we agree, there is a requirement for an upstream solution (at least a proposal) to fix that.

k3s-killall.sh (in addition to the name) seems critical to me since it SIGKILL's the processes in the container.
This could potentially lead to loss of data or inconsistencies.

  1. We switched to SIGTERM in the k3s-killall.sh script for some grace period and SIGKILL afterwards, to increase chances of a "clean" shutdown. While this is a simple approach and works in our case - we feel it's not the Kubernetes intended way to take a node off for planned maintenance.

  2. That's why we started experimenting to drain the node and uncordon it after a reboot.
    Unfortunately, that led to some (occasional) issues with the kube-scheduler when stopping k3s after draining the node.

image

[Unit]
Description=K3s Container Startup and Cleanup Handling
After=k3s.service

[Service]
Type=oneshot
RemainAfterExit=yes

ExecStart=/usr/local/bin/k3s-node-management.sh start
ExecStop=/usr/local/bin/k3s-node-management.sh stop

TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

k3s-node-management.sh stripped down to the essential parts - omitted boilerplate to wait for kube api ready:

...
case "$1" in
  start)
    echo "Uncordon node $NODE_NAME..."
    ${KUBECTL} uncordon "$NODE_NAME" && echo "Node $NODE_NAME uncordoned successfully."
    ;;

  stop)
    echo "Draining node $NODE_NAME..."
    ${KUBECTL} drain "$NODE_NAME" --disable-eviction --ignore-daemonsets --delete-emptydir-data --force
    ;;
esac
...

Note: The code shared here shall not be considered stable or tested!

@brandond
Copy link
Member

fwiw, that error isn't really an error. Some Kubernetes components react better than others to their context being cancelled by a shutdown signal, and will log odd errors while exiting. Usually you don't see these because when your running them in a container (as other distros do) the container is exiting anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Accepted
Development

No branches or pull requests

4 participants