Pods get restarted in a loop per (suspected) Prometheus failure parsing Apache logs #248

kot0dama · 2024-12-10T06:49:45Z

Bug Description

Wordpress pods are getting restarted in a loop on one of our deployments, with liveness checks failing.

We are not sure of the root cause, but could gather some evidence of Prometheus-related issues.

The first one seems to be related to promtail failing on parse errors while (as far as I understand these log lines) scraping Apache log files, which then triggers pebble to restart all services:

2024-12-10T05:23:10.443Z [promtail] level=info ts=2024-12-10T05:23:10.434149788Z caller=filetarget.go:252 msg="watching new directory" directory=/var/log/apache2
2024-12-10T05:26:23.877Z [promtail] level=error ts=2024-12-10T05:26:23.873142101Z caller=logfmt.go:139 component=file_pipeline component=stage type=logfmt msg="failed to decode logfmt" err="logfmt syntax error at pos 412 on line 1: invalid quoted value"
2024-12-10T05:26:49.662Z [pebble] Exiting on terminated signal.
2024-12-10T05:26:49.685Z [pebble] Stopping all running services.

The second issue related to Prometheus seems a red-herring (at least for the pod restarts) as it is followed by various events and not just stop:

2024-12-10T06:04:35.188Z [container-agent] 2024-12-10 06:04:35 DEBUG juju-log ops 2.14.0 up and running.
2024-12-10T06:04:35.468Z [container-agent] 2024-12-10 06:04:35 DEBUG juju-log Invalid Prometheus alert rules folder at /var/lib/juju/agents/unit-wordpress-k8s-0/charm/src/prometheus_alert_rules: directory does not exist
2024-12-10T06:04:35.477Z [container-agent] 2024-12-10 06:04:35 DEBUG juju-log Emitting Juju event update_status.

Sadly I was not able to access the log files before the pod was restarted.
We believe this issue could lie in the monitoring configuration for this charm, maybe loki rules or similar?

Thank you

To Reproduce

Not sure how to reproduce, we suspect some log lines could not match the expected format.
Perhaps redeploying an application with the same versions as described below and attempting many access types would trigger the bug.

Environment

This charm runs on a Juju 2.9.49 controller with:

App                       Version  Status  Scale  Charm                     Channel        Rev  Address     Exposed  Message
nginx-ingress-integrator  25.3.0   active      1  nginx-ingress-integrator  latest/stable   81  REDACTED     no       
wordpress-k8s             6.4.3    active      2  wordpress-k8s             latest/edge    114  REDACTED     no

Relevant log output

2024-12-10T05:23:10.443Z [promtail] level=info ts=2024-12-10T05:23:10.433939586Z caller=filetarget.go:252 msg="watching new directory" directory=/var/log/apache2
2024-12-10T05:23:10.443Z [promtail] level=info ts=2024-12-10T05:23:10.434149788Z caller=filetarget.go:252 msg="watching new directory" directory=/var/log/apache2
2024-12-10T05:26:23.877Z [promtail] level=error ts=2024-12-10T05:26:23.873142101Z caller=logfmt.go:139 component=file_pipeline component=stage type=logfmt msg="failed to decode logfmt" err="logfmt syntax error at pos 412 on line 1: invalid quoted value"
2024-12-10T05:26:49.662Z [pebble] Exiting on terminated signal.
2024-12-10T05:26:49.685Z [pebble] Stopping all running services.

https://pastebin.canonical.com/p/m3pKR5kKd5/



### Additional context

_No response_

The text was updated successfully, but these errors were encountered:

amandahla · 2024-12-13T15:06:37Z

@alithethird Can you have a look, please? Thanks.

weiiwang01 · 2024-12-17T12:09:33Z

Although the Promtail parsing failure is interesting, Promtail should not constitute as the readiness or liveness check for the charm. This means that even if Promtail fails, it shouldn't trigger a pod restart. And, I am not sure that a parsing failure for a single line would cause Promtail to fail entirely. It is more likely that WordPress's own checks are failing. Could you try upgrading WordPress to revision 114 and increasing the health_check_timeout_seconds configuration of the charm to see if that reduces the chances of restarts?

alithethird self-assigned this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods get restarted in a loop per (suspected) Prometheus failure parsing Apache logs #248

Pods get restarted in a loop per (suspected) Prometheus failure parsing Apache logs #248

kot0dama commented Dec 10, 2024

amandahla commented Dec 13, 2024

weiiwang01 commented Dec 17, 2024

Pods get restarted in a loop per (suspected) Prometheus failure parsing Apache logs #248

Pods get restarted in a loop per (suspected) Prometheus failure parsing Apache logs #248

Comments

kot0dama commented Dec 10, 2024

Bug Description

To Reproduce

Environment

Relevant log output

amandahla commented Dec 13, 2024

weiiwang01 commented Dec 17, 2024