lots of downtime on pods #1291

Coennie · 2024-03-22T09:09:45Z

Coennie
Mar 22, 2024

Hi all,
Maybe someone can help out here. I've setup a new cluster last week and deployed some site on it. Now every night and sometime during the day my sites go down. When looking at the nodes, I see the agent just have a status not-ready. But I can't figure out why.
When I look at the journal at the same moment things go down I see:

Mar 22 05:01:39 nubos-agent-large-crq systemd[1]: var-lib-rancher-k3s-agent-containerd-tmpmounts-containerd\x2dmount1334700066.mount: Deactivated successfully. Mar 22 05:01:40 nubos-agent-large-crq systemd[1]: Started libcontainer container cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b. Mar 22 05:01:43 nubos-agent-large-crq systemd[1]: cri-containerd-cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b.scope: Deactivated successfully. Mar 22 05:01:43 nubos-agent-large-crq systemd[1]: cri-containerd-cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b.scope: Consumed 2.288s CPU time. Mar 22 05:01:43 nubos-agent-large-crq systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b-rootfs.mount: Deactivated successfull> Mar 22 05:01:44 nubos-agent-large-crq k3s[1161]: I0322 05:01:44.418746 1161 scope.go:117] "RemoveContainer" containerID="afde5fb59e8311eafc5a871089baace12b88e4bdf41af7027bc797d7425e781b" Mar 22 05:01:44 nubos-agent-large-crq k3s[1161]: I0322 05:01:44.419245 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:01:44 nubos-agent-large-crq k3s[1161]: E0322 05:01:44.419683 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:01:57 nubos-agent-large-crq k3s[1161]: I0322 05:01:57.798206 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:01:57 nubos-agent-large-crq k3s[1161]: E0322 05:01:57.799144 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:06 nubos-agent-large-crq systemd[1]: run-containerd-runc-k8s.io-24f7c012e8465a91cfb70d4f02123d0e66cf2ff08830090c6e8a53e48d5b5f41-runc.JiioCG.mount: Deactivated successfully. Mar 22 05:02:10 nubos-agent-large-crq k3s[1161]: I0322 05:02:10.791455 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:02:10 nubos-agent-large-crq k3s[1161]: E0322 05:02:10.791995 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:12 nubos-agent-large-crq systemd[1]: run-containerd-runc-k8s.io-1e57026adedfbd56fac1a30f4429031aaeca13c9b9ce327f93385148372dc9a5-runc.hKfLAl.mount: Deactivated successfully. Mar 22 05:02:22 nubos-agent-large-crq k3s[1161]: I0322 05:02:22.791864 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:02:22 nubos-agent-large-crq k3s[1161]: E0322 05:02:22.792590 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:37 nubos-agent-large-crq k3s[1161]: I0322 05:02:37.792351 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:02:37 nubos-agent-large-crq k3s[1161]: E0322 05:02:37.794466 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:49 nubos-agent-large-crq k3s[1161]: I0322 05:02:49.791939 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b"

I also noticed pods with a pv attached sometimes will not release that storage for the new version of the pod. That way the new pod stays waiting on the storage and will never come active. Right now I removed all production sites again from the cluster as it's really unrelyable to work with.

Current setup:
3x Controller CAX11
3x agent CPX21

Coennie · 2024-03-25T12:01:20Z

Coennie
Mar 25, 2024
Author

@mysticaltech can I hire you for looking at this cluster and fix any issues on it?

2 replies

mysticaltech Mar 25, 2024
Maintainer

@Coennie At the moment I do not have a lot of bandwidth sorry, however I'd recommend another member of the team that is quite good @aleksasiriski.

Coennie Mar 25, 2024
Author

@mysticaltech Thnx
@aleksasiriski I've messaged on on linkedin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lots of downtime on pods #1291

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

lots of downtime on pods #1291

Coennie Mar 22, 2024

Replies: 1 comment · 2 replies

Coennie Mar 25, 2024 Author

mysticaltech Mar 25, 2024 Maintainer

Coennie Mar 25, 2024 Author

Coennie
Mar 22, 2024

Replies: 1 comment 2 replies

Coennie
Mar 25, 2024
Author

mysticaltech Mar 25, 2024
Maintainer

Coennie Mar 25, 2024
Author