Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of zombie proccesses not reaped by NCPA #1223

Open
DrewStratford opened this issue Nov 13, 2024 · 9 comments
Open

Large number of zombie proccesses not reaped by NCPA #1223

DrewStratford opened this issue Nov 13, 2024 · 9 comments

Comments

@DrewStratford
Copy link

I had an instance of ncpa that got into a state where it had a large amount of zombie processes that were not being reaped.

image

I've attached some logs, which seem to suggest some python threads not exiting properly.
ncpa-logs.txt

NCPA version is 3.1.1
OS is Ubuntu 22.04.5 LTS.

@ne-bbahn
Copy link
Contributor

Thank you for reporting this. I'll begin investigating as soon as I can.

@timcanty
Copy link

We have also had two ubuntu hosts have the same, we got to about 200+ on both boxes, stopping the service didn't fix, and had to reboot the vms. Only recently started happening.

@DrewStratford
Copy link
Author

We've fixed this issue by downgrading to 3.1.0, so it looks like this was introduced in 3.1.1.

@timcanty
Copy link

Do you have any checks running scripts, we removed a check that was running a script to check postfix and so far haven't had it occur since. So if you are running a script as one of the checks that might slim down where the issue was introduced in 3.1.1

@ne-bbahn
Copy link
Contributor

ne-bbahn commented Dec 18, 2024

@timcanty Is there any chance you could inform me on what your script was doing? I have been unable to replicate this issue, so far and could use any hints towards what the issue could be.
Anything else about your system that might elucidate the source of the issue would be appreciated. The same goes for @DrewStratford

@timcanty
Copy link

hi ne-bbahn, we haven't had any issues on the two machines that was doing this since we stopped running the script, the script was in place for a good few years without issue.
the script itself was monitoring postfix queue status, might not be the most efficient script.

#possibles states
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

#default options
postfix_dir=/var/spool/postfix
warning_active=100
critical_active=2000
warning_deferred=500
critical_deferred=1000
warning_other=1
critical_other=100

usage () {
echo "$0 [-dir postfix_dir] [-wa warning_active] [-ca critical_active] [-wd warning_deferred] [-cd critical_deferred] [-wo warning_other] [-co critical_other]" 1>&2
}

if [ -z $# ]; then
        echo "Error : need argument!" 1>&2
        usage
        exit $STATE_UNKNOWN
fi

while test -n "$1"; do
    case "$1" in
        --dir|-d ) postfix_dir=$2
                                shift;;
        --wa|-w ) warning_active=$2
                                shift;;
        --ca|-c ) critical_active=$2
                                shift;;
        --wd ) warning_deferred=$2
                                shift;;
        --cd ) critical_deferred=$2
                                shift;;
        --wo ) warning_other=$2
                                shift;;
        --co ) warning_other=$2
                                shift;;
                *) echo "Wrong arguments!" 1>&2
                   usage
           exit $STATE_UNKNOWN ;;
    esac
    shift
done
queue=$(/usr/bin/mailq | tail -n 1)
# queue empty = ok
if [ "$queue" = "Mail queue is empty" ]; then
        perfdata="'req'=0;;; 'size'=0KB;;; 'active'=0;$warning_active;$critical_active; 'bounce'=0;$warning_other;$warning_other; 'corrupt'=0;$warning_other;$warning_other; 'deferred'=0;$warning_deferred;$critical_deferred; "
        output="$queue"
        echo "OK - ${output} | ${perfdata}"
        exit $STATE_OK
else
        queue_req=$(echo $queue | cut -d ' ' -f 5)
        queue_size=$(echo $queue | cut -d ' ' -f 2)     # in KB
        queue_active=$(find $postfix_dir/active -type f | wc -l)
        queue_bounce=$(find $postfix_dir/bounce -type f | wc -l)
        queue_corrupt=$(find $postfix_dir/corrupt -type f | wc -l)
        queue_deferred=$(find $postfix_dir/deferred -type f | wc -l)
        #queue_maildrop=$(find $postfix_dir/maildrop -type f | wc -l)
        perfdata="'req'=$queue_req;;; 'size'=${queue_size}KB;;; 'active'=$queue_active;$warning_active;$critical_active; 'bounce'=$queue_bounce;$warning_other;$warning_other; 'corrupt'=$queue_corrupt;$warning_other; 'deferred'=0;$warnin>
fi

returnCrit=0
returnWarn=0
errorString=""
#Check critical and warning state for each queue
if [ $queue_active -ge $critical_active ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_active > $critical_active actives"
elif [ $queue_active -ge $warning_active ]; then
    returnWarn=1
        errorString="$errorString - WARN $queue_active > $warning_active actives"
fi
if [ $queue_bounce -ge $critical_other ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_bounce > $critical_other bounce"
elif [ $queue_bounce -ge $warning_other ]; then
    returnWarn=1
        errorString="$errorString - CRIT $queue_bounce > $warning_other bounce"
fi
if [ $queue_corrupt -ge $critical_other ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_corrupt > $critical_other corrupt"
elif [ $queue_corrupt -ge $warning_other ]; then
    returnWarn=1
        errorString="$errorString - WARN $queue_corrupt > $warning_other corrupt"
fi
if [ $queue_deferred -ge $critical_deferred ]; then
    returnCrit=1
        errorString="$errorString - CRIT $queue_deferred > $critical_deferred deferred"
elif [ $queue_deferred -ge $warning_deferred ]; then
    returnWarn=1
        errorString="$errorString - WARN $queue_deferred > $warning_deferred deferred"
fi
output="$queue_req request(s) ($queue_size kB)"
if [ $returnCrit = 0 ] && [ $returnWarn = 0 ] ; then
        echo "OK - ${output} | ${perfdata}"
        returnCode=$STATE_OK
elif [ $returnCrit = 0 ] && [ $returnWarn = 1 ] ; then
        echo "WARNING - ${output} ${errorString} | ${perfdata}"
        returnCode=$STATE_WARNING
else
        echo "CRITICAL - ${output} ${errorString} | ${perfdata}"
        returnCode=$STATE_CRITICAL
fi

exit $returnCode

@ne-bbahn
Copy link
Contributor

Are you sure it's not just taking a long time to execute the plugin? The way the API currently works, every time someone makes a check, it regenerates all of the endpoints (yes, I know this is incredibly moronic. I didn't write the code and I hope to rewrite it soon), which can make the plugins that are running over a long time stack up if the API is continually referenced.

@timcanty
Copy link

Of course its a possibility, is strange it only started been an issue after the last update of NCPA, however to be honest, for our use case we can just stop monitoring the queue length. Not sure if @DrewStratford issue the original poster was making use of any plugin's though? Sorry feel like I have hijacked this thread a little.

@Bahnerbd
Copy link

@DrewStratford Do you have any log output pertaining to plugin commands timing out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants