Replies: 1 comment 2 replies
-
@mysticaltech can I hire you for looking at this cluster and fix any issues on it? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
Maybe someone can help out here. I've setup a new cluster last week and deployed some site on it. Now every night and sometime during the day my sites go down. When looking at the nodes, I see the agent just have a status not-ready. But I can't figure out why.
When I look at the journal at the same moment things go down I see:
Mar 22 05:01:39 nubos-agent-large-crq systemd[1]: var-lib-rancher-k3s-agent-containerd-tmpmounts-containerd\x2dmount1334700066.mount: Deactivated successfully. Mar 22 05:01:40 nubos-agent-large-crq systemd[1]: Started libcontainer container cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b. Mar 22 05:01:43 nubos-agent-large-crq systemd[1]: cri-containerd-cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b.scope: Deactivated successfully. Mar 22 05:01:43 nubos-agent-large-crq systemd[1]: cri-containerd-cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b.scope: Consumed 2.288s CPU time. Mar 22 05:01:43 nubos-agent-large-crq systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b-rootfs.mount: Deactivated successfull> Mar 22 05:01:44 nubos-agent-large-crq k3s[1161]: I0322 05:01:44.418746 1161 scope.go:117] "RemoveContainer" containerID="afde5fb59e8311eafc5a871089baace12b88e4bdf41af7027bc797d7425e781b" Mar 22 05:01:44 nubos-agent-large-crq k3s[1161]: I0322 05:01:44.419245 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:01:44 nubos-agent-large-crq k3s[1161]: E0322 05:01:44.419683 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:01:57 nubos-agent-large-crq k3s[1161]: I0322 05:01:57.798206 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:01:57 nubos-agent-large-crq k3s[1161]: E0322 05:01:57.799144 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:06 nubos-agent-large-crq systemd[1]: run-containerd-runc-k8s.io-24f7c012e8465a91cfb70d4f02123d0e66cf2ff08830090c6e8a53e48d5b5f41-runc.JiioCG.mount: Deactivated successfully. Mar 22 05:02:10 nubos-agent-large-crq k3s[1161]: I0322 05:02:10.791455 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:02:10 nubos-agent-large-crq k3s[1161]: E0322 05:02:10.791995 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:12 nubos-agent-large-crq systemd[1]: run-containerd-runc-k8s.io-1e57026adedfbd56fac1a30f4429031aaeca13c9b9ce327f93385148372dc9a5-runc.hKfLAl.mount: Deactivated successfully. Mar 22 05:02:22 nubos-agent-large-crq k3s[1161]: I0322 05:02:22.791864 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:02:22 nubos-agent-large-crq k3s[1161]: E0322 05:02:22.792590 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:37 nubos-agent-large-crq k3s[1161]: I0322 05:02:37.792351 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b" Mar 22 05:02:37 nubos-agent-large-crq k3s[1161]: E0322 05:02:37.794466 1161 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: > Mar 22 05:02:49 nubos-agent-large-crq k3s[1161]: I0322 05:02:49.791939 1161 scope.go:117] "RemoveContainer" containerID="cfbebc38609238815a781ea8f1e70495c202e1245862ce5423de3970927ea85b"
I also noticed pods with a pv attached sometimes will not release that storage for the new version of the pod. That way the new pod stays waiting on the storage and will never come active. Right now I removed all production sites again from the cluster as it's really unrelyable to work with.
Current setup:
3x Controller CAX11
3x agent CPX21
Beta Was this translation helpful? Give feedback.
All reactions