Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

99% CPU utilization after 8 Hours #2103

Open
Blitzritze opened this issue Oct 31, 2024 · 13 comments
Open

99% CPU utilization after 8 Hours #2103

Blitzritze opened this issue Oct 31, 2024 · 13 comments

Comments

@Blitzritze
Copy link

Versions

Core
Version is v5.18.3-550-g112b961 (Latest: null)
Branch is development
Hash is 112b9617 (Latest: 112b9617)
Web
Version is v5.21-1012-gc689fdc7 (Latest: null)
Branch is development
Hash is c689fdc7 (Latest: c689fdc7)
FTL
Version is vDev-3dd756d (Latest: null)
Branch is development
Hash is 3dd756d (Latest: 3dd756d)

Platform

  • OS and version: Debian 12
  • Platform: Proxmox, Unprivileged Container with 1vCore, 1GiB RAM, 5GiB Disk size
  • Hardware: MB/CPU: ASRock N100M, RAM: Team Electronic Elite 32GB DDR4 3200,

Expected behavior

Normal execution of the DNS block without over 100% CPU usage

Actual behavior / bug

Every 8 hours, a 99% CPU utilization is displayed in Proxmox in the pihole container. In HTOP, in the pihole container, a 400% CPU utilization is displayed, this comes from FTL.

Steps to reproduce

Wait 8 hours and see

Debug Token

Screenshots

Proxmox:
Image

Pihole Container:
Image

@DL6ER
Copy link
Member

DL6ER commented Oct 31, 2024

this comes from FTL

it'd be very helpful if you could enable tree view in htop and also enable custom thread names, like
Image

Thereafter, you will be able to see which part of Pi-hole is consuming the CPU which will help us a lot trying to find the root cause

Image

@DL6ER DL6ER transferred this issue from pi-hole/pi-hole Oct 31, 2024
@Blitzritze
Copy link
Author

I changed the settings from HTOP.
Strangely enough, nothing happened this night i'll keep an eye on it...
I installed Pihole V6 the day before yesterday and since then it has happened several times.

@Blitzritze
Copy link
Author

It happened again.
I ran pihole -d again while the problem was there
URL: https://tricorder.pi-hole.net/as41yWRB/
Image

@DL6ER
Copy link
Member

DL6ER commented Nov 3, 2024

Oh, this is interesting. So 400% CPU is consumed in civetweb-worker which is one of the worker threads of the integrated web API.

My next suggestion would be trying to identify which API endpoint is causing this. For this, please run

sudo pihole-FTL --config debug.api true

Once it happens again, have another a look into htop and identify the PID of the thread causing this - it's 18524 in your example above:
Image

We can then search the FTL log file with enabled debug.api to identify which API request is taking so long by running

grep "T18524" /var/log/pihole/FTL.log | tail

You'll obviously have to replace the numeric part of the T18524 by the PID you have identified above. What is this output?
You can, of course, remove any IP addresses from the output before posting it here, just make it clea when you do do, e.g. ----removed----.

@Blitzritze
Copy link
Author

Blitzritze commented Nov 4, 2024

I have noticed that when the problem occurs, the web interface can no longer be accessed.

FTL Log:
Image

Edit:
It happend again. 10.0.1.1 is my Docker VM. On this VM i have NGINX proxy manager
Image

Nginx settings: no custom locations or advanced settings.
Image
Image

@DL6ER
Copy link
Member

DL6ER commented Nov 4, 2024

Sorry, I didn't see you made an edit later, adding more content. This is still (or, even more) puzzling as the logs themselves do not show anything being processed at the time of these massive CPU spikes making me think this may be some strange ill interaction between FTL and your nginx proxy. Does the latter have any logs for us? Is there also notable CPU load on nginx when this is happening?

Also check out the lines of /var/log/pihole/webserver.log at that time which becomes an access log when API debug logging is enabled. This, together with the nginx logs hopefully sheds some light on the matter. My current hypothesis is that there may be some massive redirection going on here keeping the proxy and FTL busy.

Edit No need to enable any additional logging, debug.api already enables the access logs so you already have the log at the time when it happened.

@Blitzritze
Copy link
Author

Blitzritze commented Nov 7, 2024

I tried to gather all Informations. I hope this is all you need.
Today 07.11.24 at 06:31 did the error occurred.

Webserverlog:
Image

Nginx Proxyhost 14 Errorlog:
Image

Nginx Proxyhost 14 Accesslog:
Image

10.0.0.22 -> Notebook
10.0.2.1 -> Pi-Hole
10.5.0.1 -> Handy
10.0.1.1 -> Docker (with niginx etc.)
10.10.0.4 -> Amazn Echo Dot

@DL6ER
Copy link
Member

DL6ER commented Nov 7, 2024

Today 07.11.24 at 06:31 did the error occurred.

What timezone is this? The logs are in UTC but nothing extends to later than 05:03 UTC there.

@Blitzritze
Copy link
Author

Today 07.11.24 at 06:31 did the error occurred.

What timezone is this? The logs are in UTC but nothing extends to later than 05:03 UTC there.

Sorry its "Timezone in Berlin (GMT+1) Central European Standard Time"

@DL6ER
Copy link
Member

DL6ER commented Nov 9, 2024

I'm still somewhat confused - 06:31 Berlin should be 05:31 UTC. But neither of your logs show anything after 05:22 UTC... How is your particular setup configured? Could you put a tcpdump process next to the traffic between nginx and pihole-FTL on the HTTP port to see what really is happening? I wonder if the issue may be something stupid like an endless loop of null-bytes being transferred, ultimately causing the webserver to stay busy the whole time doing nonsense?

@Blitzritze
Copy link
Author

I ran a tcpdump in both containers. when i specified port 80 or 443 nothing was recorded so i recorded all the traffic from the respective “partners” but not much came out of it.
nginx.dmp
pihole.dmp

@DL6ER
Copy link
Member

DL6ER commented Nov 11, 2024

Okay, so let me collect what we already know:

  • There is a nginx proxy interacting with the FTL webserver, the FTL webserver is not contacted by any client directly
  • Both webserver threads are busy at 200% CPU each
  • The webserver is not busy because it is processing something in the FTL API or serving any file
  • There seems to be neither traffic to nor from the FTL or nginx servers related to this issue.

Currently, my best guess is that the external webserver we are embedding (civetweb) has a bug causing some endless loop somewhere.

I'd like you to attach your pihole-FTL process to the gdb debugger, see instructions for v6 here while we are still in the beta. It will allow you to interrupt pihole-FTL when you notice this issue (using Ctrl + C) and then see what two webserver threads are doing at this very moment.

Once you interrupted pihole-FTL, please enter info threads. It should look like:

^C
Thread 1 "pihole-FTL" received signal SIGINT, Interrupt.
0x000070ef9191b4cd in __GI___poll (fds=0x61d3cc680870, nfds=7, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      in ../sysdeps/unix/sysv/linux/poll.c

(gdb) info threads <-----------------

to get a listing of the currently active threads.

The output will look like

  Id   Target Id                                           Frame 
* 1    Thread 0x70ef91a77740 (LWP 31708) "pihole-FTL"      0x000070ef9191b4cd in __GI___poll (fds=0x61d3cc680870, nfds=7, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  2    Thread 0x70ef874006c0 (LWP 31761) "civetweb-worker" 0x000070ef91898d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70ef916150c8) at ./nptl/futex-internal.c:57
  3    Thread 0x70ef87e006c0 (LWP 31756) "civetweb-worker" 0x000070ef91898d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70ef916150cc) at ./nptl/futex-internal.c:57
[...]

Have a look for the threads where you see the massive CPU spike, it'd be IDs 2 and 3 in the example above.

You can then switch context from ID 1 (notice the * in the output above) using thread and the ID, like

(gdb) thread 2

and run backtrace to get where the code currently is busy:

(gdb) backtrace
#0  0x000070ef91898d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70ef916150c8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x70ef916150c8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x70ef916150c8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, 
    private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000070ef9189b7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x70ef91615040, cond=0x70ef916150a0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x70ef916150a0, mutex=mutex@entry=0x70ef91615040) at ./nptl/pthread_cond_wait.c:627
#5  0x000061d3cb6c54db in consume_socket (thread_index=<optimized out>, counter_was_preincremented=<optimized out>, sp=<optimized out>, ctx=0x70ef91615010)
    at /workspace/FTL/src/webserver/civetweb/civetweb.c:20022
#6  worker_thread_run (conn=0x61d3d1456bb0) at /workspace/FTL/src/webserver/civetweb/civetweb.c:20170
#7  worker_thread (thread_func_param=0x61d3d1456bb0) at /workspace/FTL/src/webserver/civetweb/civetweb.c:20345
#8  0x000070ef9189ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#9  0x000070ef91929c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

It is this backtrace output, or, even better, the entire output you have seen in the debugger which we need to proceed here. Hopefully this will give us an idea of what is going on. You can leave the debugger using quit at the end.

Copy link

This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.

@github-actions github-actions bot added the stale label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants