Large number of zombie proccesses not reaped by NCPA #1223

DrewStratford · 2024-11-13T21:58:15Z

I had an instance of ncpa that got into a state where it had a large amount of zombie processes that were not being reaped.

I've attached some logs, which seem to suggest some python threads not exiting properly.
ncpa-logs.txt

NCPA version is 3.1.1
OS is Ubuntu 22.04.5 LTS.

ne-bbahn · 2024-11-13T22:05:42Z

Thank you for reporting this. I'll begin investigating as soon as I can.

timcanty · 2024-11-15T20:25:49Z

We have also had two ubuntu hosts have the same, we got to about 200+ on both boxes, stopping the service didn't fix, and had to reboot the vms. Only recently started happening.

DrewStratford · 2024-11-26T02:05:36Z

We've fixed this issue by downgrading to 3.1.0, so it looks like this was introduced in 3.1.1.

timcanty · 2024-11-26T07:16:45Z

Do you have any checks running scripts, we removed a check that was running a script to check postfix and so far haven't had it occur since. So if you are running a script as one of the checks that might slim down where the issue was introduced in 3.1.1

ne-bbahn · 2024-12-18T22:05:11Z

@timcanty Is there any chance you could inform me on what your script was doing? I have been unable to replicate this issue, so far and could use any hints towards what the issue could be.
Anything else about your system that might elucidate the source of the issue would be appreciated. The same goes for @DrewStratford

timcanty · 2024-12-19T09:12:42Z

hi ne-bbahn, we haven't had any issues on the two machines that was doing this since we stopped running the script, the script was in place for a good few years without issue.
the script itself was monitoring postfix queue status, might not be the most efficient script.

#possibles states
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

#default options
postfix_dir=/var/spool/postfix
warning_active=100
critical_active=2000
warning_deferred=500
critical_deferred=1000
warning_other=1
critical_other=100

usage () {
echo "$0 [-dir postfix_dir] [-wa warning_active] [-ca critical_active] [-wd warning_deferred] [-cd critical_deferred] [-wo warning_other] [-co critical_other]" 1>&2
}

if [ -z $# ]; then
        echo "Error : need argument!" 1>&2
        usage
        exit $STATE_UNKNOWN
fi

while test -n "$1"; do
    case "$1" in
        --dir|-d ) postfix_dir=$2
                                shift;;
        --wa|-w ) warning_active=$2
                                shift;;
        --ca|-c ) critical_active=$2
                                shift;;
        --wd ) warning_deferred=$2
                                shift;;
        --cd ) critical_deferred=$2
                                shift;;
        --wo ) warning_other=$2
                                shift;;
        --co ) warning_other=$2
                                shift;;
                *) echo "Wrong arguments!" 1>&2
                   usage
           exit $STATE_UNKNOWN ;;
    esac
    shift
done
queue=$(/usr/bin/mailq | tail -n 1)
# queue empty = ok
if [ "$queue" = "Mail queue is empty" ]; then
        perfdata="'req'=0;;; 'size'=0KB;;; 'active'=0;$warning_active;$critical_active; 'bounce'=0;$warning_other;$warning_other; 'corrupt'=0;$warning_other;$warning_other; 'deferred'=0;$warning_deferred;$critical_deferred; "
        output="$queue"
        echo "OK - ${output} | ${perfdata}"
        exit $STATE_OK
else
        queue_req=$(echo $queue | cut -d ' ' -f 5)
        queue_size=$(echo $queue | cut -d ' ' -f 2)     # in KB
        queue_active=$(find $postfix_dir/active -type f | wc -l)
        queue_bounce=$(find $postfix_dir/bounce -type f | wc -l)
        queue_corrupt=$(find $postfix_dir/corrupt -type f | wc -l)
        queue_deferred=$(find $postfix_dir/deferred -type f | wc -l)
        #queue_maildrop=$(find $postfix_dir/maildrop -type f | wc -l)
        perfdata="'req'=$queue_req;;; 'size'=${queue_size}KB;;; 'active'=$queue_active;$warning_active;$critical_active; 'bounce'=$queue_bounce;$warning_other;$warning_other; 'corrupt'=$queue_corrupt;$warning_other; 'deferred'=0;$warnin>
fi

returnCrit=0
returnWarn=0
errorString=""
#Check critical and warning state for each queue
if [ $queue_active -ge $critical_active ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_active > $critical_active actives"
elif [ $queue_active -ge $warning_active ]; then
    returnWarn=1
        errorString="$errorString - WARN $queue_active > $warning_active actives"
fi
if [ $queue_bounce -ge $critical_other ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_bounce > $critical_other bounce"
elif [ $queue_bounce -ge $warning_other ]; then
    returnWarn=1
        errorString="$errorString - CRIT $queue_bounce > $warning_other bounce"
fi
if [ $queue_corrupt -ge $critical_other ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_corrupt > $critical_other corrupt"
elif [ $queue_corrupt -ge $warning_other ]; then
    returnWarn=1
        errorString="$errorString - WARN $queue_corrupt > $warning_other corrupt"
fi
if [ $queue_deferred -ge $critical_deferred ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_deferred > $critical_deferred deferred"
elif [ $queue_deferred -ge $warning_deferred ]; then
    returnWarn=1
        errorString="$errorString - WARN $queue_deferred > $warning_deferred deferred"
fi
output="$queue_req request(s) ($queue_size kB)"
if [ $returnCrit = 0 ] && [ $returnWarn = 0 ] ; then
        echo "OK - ${output} | ${perfdata}"
        returnCode=$STATE_OK
elif [ $returnCrit = 0 ] && [ $returnWarn = 1 ] ; then
        echo "WARNING - ${output} ${errorString} | ${perfdata}"
        returnCode=$STATE_WARNING
else
        echo "CRITICAL - ${output} ${errorString} | ${perfdata}"
        returnCode=$STATE_CRITICAL
fi

exit $returnCode

ne-bbahn · 2024-12-19T16:17:00Z

Are you sure it's not just taking a long time to execute the plugin? The way the API currently works, every time someone makes a check, it regenerates all of the endpoints (yes, I know this is incredibly moronic. I didn't write the code and I hope to rewrite it soon), which can make the plugins that are running over a long time stack up if the API is continually referenced.

timcanty · 2024-12-19T16:26:21Z

Of course its a possibility, is strange it only started been an issue after the last update of NCPA, however to be honest, for our use case we can just stop monitoring the queue length. Not sure if @DrewStratford issue the original poster was making use of any plugin's though? Sorry feel like I have hijacked this thread a little.

Bahnerbd · 2024-12-27T22:00:20Z

@DrewStratford Do you have any log output pertaining to plugin commands timing out?

ne-bbahn added Bug 3.X.X labels Nov 13, 2024

ne-bbahn added the Priority label Dec 12, 2024

ne-bbahn added Potential Bug and removed Bug labels Dec 19, 2024

ne-bbahn added the Need Information label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of zombie proccesses not reaped by NCPA #1223

Large number of zombie proccesses not reaped by NCPA #1223

DrewStratford commented Nov 13, 2024

ne-bbahn commented Nov 13, 2024

timcanty commented Nov 15, 2024

DrewStratford commented Nov 26, 2024

timcanty commented Nov 26, 2024

ne-bbahn commented Dec 18, 2024 •

edited

Loading

timcanty commented Dec 19, 2024

ne-bbahn commented Dec 19, 2024

timcanty commented Dec 19, 2024

Bahnerbd commented Dec 27, 2024

Large number of zombie proccesses not reaped by NCPA #1223

Large number of zombie proccesses not reaped by NCPA #1223

Comments

DrewStratford commented Nov 13, 2024

ne-bbahn commented Nov 13, 2024

timcanty commented Nov 15, 2024

DrewStratford commented Nov 26, 2024

timcanty commented Nov 26, 2024

ne-bbahn commented Dec 18, 2024 • edited Loading

timcanty commented Dec 19, 2024

ne-bbahn commented Dec 19, 2024

timcanty commented Dec 19, 2024

Bahnerbd commented Dec 27, 2024

ne-bbahn commented Dec 18, 2024 •

edited

Loading