-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large number of zombie proccesses not reaped by NCPA #1223
Comments
Thank you for reporting this. I'll begin investigating as soon as I can. |
We have also had two ubuntu hosts have the same, we got to about 200+ on both boxes, stopping the service didn't fix, and had to reboot the vms. Only recently started happening. |
We've fixed this issue by downgrading to 3.1.0, so it looks like this was introduced in 3.1.1. |
Do you have any checks running scripts, we removed a check that was running a script to check postfix and so far haven't had it occur since. So if you are running a script as one of the checks that might slim down where the issue was introduced in 3.1.1 |
@timcanty Is there any chance you could inform me on what your script was doing? I have been unable to replicate this issue, so far and could use any hints towards what the issue could be. |
hi ne-bbahn, we haven't had any issues on the two machines that was doing this since we stopped running the script, the script was in place for a good few years without issue. #possibles states
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
#default options
postfix_dir=/var/spool/postfix
warning_active=100
critical_active=2000
warning_deferred=500
critical_deferred=1000
warning_other=1
critical_other=100
usage () {
echo "$0 [-dir postfix_dir] [-wa warning_active] [-ca critical_active] [-wd warning_deferred] [-cd critical_deferred] [-wo warning_other] [-co critical_other]" 1>&2
}
if [ -z $# ]; then
echo "Error : need argument!" 1>&2
usage
exit $STATE_UNKNOWN
fi
while test -n "$1"; do
case "$1" in
--dir|-d ) postfix_dir=$2
shift;;
--wa|-w ) warning_active=$2
shift;;
--ca|-c ) critical_active=$2
shift;;
--wd ) warning_deferred=$2
shift;;
--cd ) critical_deferred=$2
shift;;
--wo ) warning_other=$2
shift;;
--co ) warning_other=$2
shift;;
*) echo "Wrong arguments!" 1>&2
usage
exit $STATE_UNKNOWN ;;
esac
shift
done
queue=$(/usr/bin/mailq | tail -n 1)
# queue empty = ok
if [ "$queue" = "Mail queue is empty" ]; then
perfdata="'req'=0;;; 'size'=0KB;;; 'active'=0;$warning_active;$critical_active; 'bounce'=0;$warning_other;$warning_other; 'corrupt'=0;$warning_other;$warning_other; 'deferred'=0;$warning_deferred;$critical_deferred; "
output="$queue"
echo "OK - ${output} | ${perfdata}"
exit $STATE_OK
else
queue_req=$(echo $queue | cut -d ' ' -f 5)
queue_size=$(echo $queue | cut -d ' ' -f 2) # in KB
queue_active=$(find $postfix_dir/active -type f | wc -l)
queue_bounce=$(find $postfix_dir/bounce -type f | wc -l)
queue_corrupt=$(find $postfix_dir/corrupt -type f | wc -l)
queue_deferred=$(find $postfix_dir/deferred -type f | wc -l)
#queue_maildrop=$(find $postfix_dir/maildrop -type f | wc -l)
perfdata="'req'=$queue_req;;; 'size'=${queue_size}KB;;; 'active'=$queue_active;$warning_active;$critical_active; 'bounce'=$queue_bounce;$warning_other;$warning_other; 'corrupt'=$queue_corrupt;$warning_other; 'deferred'=0;$warnin>
fi
returnCrit=0
returnWarn=0
errorString=""
#Check critical and warning state for each queue
if [ $queue_active -ge $critical_active ]; then
returnCrit=1
errorString="$errorString - CRIT $queue_active > $critical_active actives"
elif [ $queue_active -ge $warning_active ]; then
returnWarn=1
errorString="$errorString - WARN $queue_active > $warning_active actives"
fi
if [ $queue_bounce -ge $critical_other ]; then
returnCrit=1
errorString="$errorString - CRIT $queue_bounce > $critical_other bounce"
elif [ $queue_bounce -ge $warning_other ]; then
returnWarn=1
errorString="$errorString - CRIT $queue_bounce > $warning_other bounce"
fi
if [ $queue_corrupt -ge $critical_other ]; then
returnCrit=1
errorString="$errorString - CRIT $queue_corrupt > $critical_other corrupt"
elif [ $queue_corrupt -ge $warning_other ]; then
returnWarn=1
errorString="$errorString - WARN $queue_corrupt > $warning_other corrupt"
fi
if [ $queue_deferred -ge $critical_deferred ]; then
returnCrit=1
errorString="$errorString - CRIT $queue_deferred > $critical_deferred deferred"
elif [ $queue_deferred -ge $warning_deferred ]; then
returnWarn=1
errorString="$errorString - WARN $queue_deferred > $warning_deferred deferred"
fi
output="$queue_req request(s) ($queue_size kB)"
if [ $returnCrit = 0 ] && [ $returnWarn = 0 ] ; then
echo "OK - ${output} | ${perfdata}"
returnCode=$STATE_OK
elif [ $returnCrit = 0 ] && [ $returnWarn = 1 ] ; then
echo "WARNING - ${output} ${errorString} | ${perfdata}"
returnCode=$STATE_WARNING
else
echo "CRITICAL - ${output} ${errorString} | ${perfdata}"
returnCode=$STATE_CRITICAL
fi
exit $returnCode
|
Are you sure it's not just taking a long time to execute the plugin? The way the API currently works, every time someone makes a check, it regenerates all of the endpoints (yes, I know this is incredibly moronic. I didn't write the code and I hope to rewrite it soon), which can make the plugins that are running over a long time stack up if the API is continually referenced. |
Of course its a possibility, is strange it only started been an issue after the last update of NCPA, however to be honest, for our use case we can just stop monitoring the queue length. Not sure if @DrewStratford issue the original poster was making use of any plugin's though? Sorry feel like I have hijacked this thread a little. |
@DrewStratford Do you have any log output pertaining to plugin commands timing out? |
I had an instance of ncpa that got into a state where it had a large amount of zombie processes that were not being reaped.
I've attached some logs, which seem to suggest some python threads not exiting properly.
ncpa-logs.txt
NCPA version is 3.1.1
OS is Ubuntu 22.04.5 LTS.
The text was updated successfully, but these errors were encountered: