Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods get restarted in a loop per (suspected) Prometheus failure parsing Apache logs #248

Open
kot0dama opened this issue Dec 10, 2024 · 2 comments
Assignees

Comments

@kot0dama
Copy link

Bug Description

Wordpress pods are getting restarted in a loop on one of our deployments, with liveness checks failing.

We are not sure of the root cause, but could gather some evidence of Prometheus-related issues.

The first one seems to be related to promtail failing on parse errors while (as far as I understand these log lines) scraping Apache log files, which then triggers pebble to restart all services:

2024-12-10T05:23:10.443Z [promtail] level=info ts=2024-12-10T05:23:10.434149788Z caller=filetarget.go:252 msg="watching new directory" directory=/var/log/apache2
2024-12-10T05:26:23.877Z [promtail] level=error ts=2024-12-10T05:26:23.873142101Z caller=logfmt.go:139 component=file_pipeline component=stage type=logfmt msg="failed to decode logfmt" err="logfmt syntax error at pos 412 on line 1: invalid quoted value"
2024-12-10T05:26:49.662Z [pebble] Exiting on terminated signal.
2024-12-10T05:26:49.685Z [pebble] Stopping all running services.

The second issue related to Prometheus seems a red-herring (at least for the pod restarts) as it is followed by various events and not just stop:

2024-12-10T06:04:35.188Z [container-agent] 2024-12-10 06:04:35 DEBUG juju-log ops 2.14.0 up and running.
2024-12-10T06:04:35.468Z [container-agent] 2024-12-10 06:04:35 DEBUG juju-log Invalid Prometheus alert rules folder at /var/lib/juju/agents/unit-wordpress-k8s-0/charm/src/prometheus_alert_rules: directory does not exist
2024-12-10T06:04:35.477Z [container-agent] 2024-12-10 06:04:35 DEBUG juju-log Emitting Juju event update_status.

Sadly I was not able to access the log files before the pod was restarted.
We believe this issue could lie in the monitoring configuration for this charm, maybe loki rules or similar?

Thank you

To Reproduce

Not sure how to reproduce, we suspect some log lines could not match the expected format.
Perhaps redeploying an application with the same versions as described below and attempting many access types would trigger the bug.

Environment

This charm runs on a Juju 2.9.49 controller with:

App                       Version  Status  Scale  Charm                     Channel        Rev  Address     Exposed  Message
nginx-ingress-integrator  25.3.0   active      1  nginx-ingress-integrator  latest/stable   81  REDACTED     no       
wordpress-k8s             6.4.3    active      2  wordpress-k8s             latest/edge    114  REDACTED     no       

Relevant log output

2024-12-10T05:23:10.443Z [promtail] level=info ts=2024-12-10T05:23:10.433939586Z caller=filetarget.go:252 msg="watching new directory" directory=/var/log/apache2
2024-12-10T05:23:10.443Z [promtail] level=info ts=2024-12-10T05:23:10.434149788Z caller=filetarget.go:252 msg="watching new directory" directory=/var/log/apache2
2024-12-10T05:26:23.877Z [promtail] level=error ts=2024-12-10T05:26:23.873142101Z caller=logfmt.go:139 component=file_pipeline component=stage type=logfmt msg="failed to decode logfmt" err="logfmt syntax error at pos 412 on line 1: invalid quoted value"
2024-12-10T05:26:49.662Z [pebble] Exiting on terminated signal.
2024-12-10T05:26:49.685Z [pebble] Stopping all running services.

https://pastebin.canonical.com/p/m3pKR5kKd5/



### Additional context

_No response_
@amandahla
Copy link
Contributor

@alithethird Can you have a look, please? Thanks.

@alithethird alithethird self-assigned this Dec 16, 2024
@weiiwang01
Copy link
Collaborator

Although the Promtail parsing failure is interesting, Promtail should not constitute as the readiness or liveness check for the charm. This means that even if Promtail fails, it shouldn't trigger a pod restart. And, I am not sure that a parsing failure for a single line would cause Promtail to fail entirely. It is more likely that WordPress's own checks are failing. Could you try upgrading WordPress to revision 114 and increasing the health_check_timeout_seconds configuration of the charm to see if that reduces the chances of restarts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants