Nagios plugins for Ceph

A collection of nagios plugins to monitor a Ceph cluster.

Authentication

Ceph is normally configured to use cephx to authenticate its client.

To run the check_ceph_health or other plugins as user nagios you have to create a special keyring:

root# ceph auth get-or-create client.nagios mon 'allow r' > ceph.client.nagios.keyring

And use this keyring with the plugin:

nagios$ ./check_ceph_health --id nagios --keyring ceph.client.nagios.keyring

check_ceph_health

The check_ceph_health nagios plugin monitors the ceph cluster, and report its health. Can be filtered to only look at certain health checks.

Usage

usage: check_ceph_health [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-n NAME] [-i ID] [-k KEYRING] [-w WHITELIST] [-d]

'ceph health' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor address[:port]
  -i ID, --id ID        ceph client id
  -n NAME, --name NAME  ceph client name
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  --check CHECK         regexp of which check(s) to check (luminous+) Can be
                        inverted, e.g. '^((?!PG_DEGRADED|OBJECT_MISPLACED).)*$'
  -w, --whitelist REGEXP
                        whitelist regexp for ceph health warnings
  -d, --detail          exec 'ceph health detail'
  -V, --version         show version and exit

Example

nagios$ ./check_ceph_health --name client.nagios --keyring ceph.client.nagios.keyring
HEALTH WARNING: 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; recovery 4448/28924462 degraded (0.015%); 2/9857830 unfound (0.000%);
nagios$ echo $?
1
nagios$

nagios$ ./check_ceph_health --id nagios --whitelist 'requests.are.blocked(\s)*32.sec'

nagios$ ./check_ceph_health --id nagios
WARNING: MON_CLOCK_SKEW( clock skew detected on mon.a )
OBJECT_MISPLACED( 1937172/695961284 objects misplaced (0.278%) )
PG_DEGRADED( Degraded data redundancy: 98/695961284 objects degraded (0.000%), 1 pg degraded )

nagios$ ./check_ceph_health --id nagios --check 'PG_DEGRADED|OBJECT_MISPLACED'
WARNING: OBJECT_MISPLACED( 1937172/695961284 objects misplaced (0.278%) )
PG_DEGRADED( Degraded data redundancy: 98/695961284 objects degraded (0.000%), 1 pg degraded )

nagios$ ./check_ceph_health --id nagios --check '^((?!PG_DEGRADED|OBJECT_MISPLACED).)*$'
WARNING: MON_CLOCK_SKEW( clock skew detected on mon.a )

check_ceph_mon

The check_ceph_mon nagios plugin monitors an individual mon daemon, reporting its status.

Possible result includes OK (up), WARN (missing).

Usage

usage: check_ceph_mon [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                      [-k KEYRING] [-V] [-I MONID]

'ceph quorum_status' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor to use for queries (address[:port])
  -i ID, --id ID        ceph client id
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -V, --version         show version and exit
  -I MONID, --monid MONID
                        mon ID to be checked for availability

Example

nagios$ ./check_ceph_mon -I node1
MON OK

nagios$ ./check_ceph_mon --monid node2
MON WARN: no mon 'node2' found in quorum

check_ceph_osd

The check_ceph_osd nagios plugin monitors an individual osd daemon or host, reporting its status.

Possible result includes OK (up), WARN (down or missing).

Usage

usage: check_ceph_osd [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                     [-k KEYRING] [-V] -H HOST [-I OSDID] [-o]

'ceph osd' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor address[:port]
  -i ID, --id ID        ceph client id
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -V, --version         show version and exit
  -H HOST, --host HOST  osd host
  -I OSDID, --osdid OSDID
                        osd id
  -o, --out             check osds that are set OUT

Example

nagios$ ./check_ceph_osd -H 172.17.0.2 -I 0
OSD OK

nagios$ ./check_ceph_osd -H 172.17.0.2 -I 0
OSD WARN: OSD.0 is down at 172.17.0.2

nagios$ ./check_ceph_osd -H 172.17.0.2 -I 100
OSD WARN: no OSD.100 found at host 172.17.0.2

nagios$ ./check_ceph_osd -H 172.17.0.2
OSD WARN: Down OSD on 172.17.0.2: osd.0

check_ceph_rgw

The check_ceph_rgw nagios plugin monitors a ceph rados gateway, reporting its status and buckets usage.

Possible result includes OK (up), WARN (down or missing).

Usage

usage: check_ceph_rgw [-h] [-d] [-B] [-e EXE] [-c CONF] [-i ID] [-V]

'radosgw-admin bucket stats' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -d, --detail          output perf data for all buckets
  -B, --byte            output perf data in Byte instead of KB
  -e EXE, --exe EXE     radosgw-admin executable [/usr/bin/radosgw-admin]
  -c CONF, --conf CONF  alternative ceph conf file
  -i ID, --id ID        ceph client id
  -n NAME, --name NAME  ceph client name      
  -V, --version         show version and exit

Example

nagios$ ./check_ceph_rgw
RGW OK: 4 buckets, 102276 KB total | /=102276KB

nagios$ ./check_ceph_rgw --detail --byte
RGW OK: 4 buckets, 102276 KB total | /=104730624B bucket-test1=151552B bucket-test0=12288B bucket-test2=104566784B bucket-test=0B

check_ceph_rgw_api

The check_ceph_rgw_api nagios plugin monitors a ceph rados gateway, reporting its status and buckets usage.

Difference with `check_ceph_rgw`:

check_ceph_rgw is designed for connect to cluster, check_ceph_rgw_api is connected to radosgw directly via admin api. You can check each instance of radosgw or only one endpoint via proxy/balancer (or both).

Possible results

OK - bucket info recieved from radosgw;
WARNING - connected, but wrong admin entry or usage caps;
UNKNOWN - can't connect to proxy/balancer or radosgw directly;

Requirements

Install requests-aws python library:

pip install requests-aws

Configure admin entry point (default is 'admin'):

rgw admin entry = "admin"

Enable admin API (default is enabled):

rgw enable apis = "s3, admin"

Add capability buckets=read for your user who performed checks, see Admin Guide for more details.

Usage

usage: check_ceph_rgw_api [-h] -H HOST [-k] [-e ADMIN_ENTRY] -a ACCESS_KEY -s
                      SECRET_KEY [-d] [-b] [-v]

'radosgw api bucket stats' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  Server URL for the radosgw api (example:
                        http://objects.dreamhost.com/)
  -k, --insecure        Allow insecure server connections when using SSL
  -e ADMIN_ENTRY, --admin_entry ADMIN_ENTRY
                        The entry point for an admin request URL [default is
                        'admin']
  -a ACCESS_KEY, --access_key ACCESS_KEY
                        S3 access key
  -s SECRET_KEY, --secret_key SECRET_KEY
                        S3 secret key
  -d, --detail          output perf data for all buckets
  -b, --byte            output perf data in Byte instead of KB
  -v, --version         show version and exit

Example

nagios$ ./check_ceph_rgw_api -H https://objects.dreamhost.com/ -a JXUABTZZYHAFLCMF9VYV -s jjP8RDD0R156atS6ACSy2vNdJLdEPM0TJQ5jD1pw
RGW OK: 1 buckets, 7696 KB total | /=7696KB

nagios$ ./check_ceph_rgw_api -H objects.dreamhost.com -a JXUABTZZYHAFLCMF9VYV -s jjP8RDD0R156atS6ACSy2vNdJLdEPM0TJQ5jD1pw --detail --byte
RGW OK: 1 buckets, 7696 KB total | /=7880704B k0ste=7880704B

check_ceph_df

The check_ceph_df nagios plugin monitors a ceph cluster, reporting its percentual RAW capacity usage, or specific pool usage.

Possible result includes OK, WARN and CRITICAL.

Usage

usage: check_ceph_df [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID] [-n NAME]
                     [-k KEYRING] [-d] [-W WARN] [-C CRITICAL] [-V]

'ceph df' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor address[:port]
  -i ID, --id ID        ceph client id
  -n NAME, --name NAME  ceph client name
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -p POOL, --pool POOL  ceph pool name
  -d, --detail          show pool details on warn and critical
  -W WARN, --warn WARN  warn above this percent RAW USED
  -C CRITICAL, --critical CRITICAL
                        critical alert above this percent RAW USED
  -V, --version         show version and exit

Example

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 29.12 -C 30.22 -d
RAW usage 28.36%

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 26.14 -C 30
WARNING: global RAW usage of 28.36% is above 26.14% (783G of 1093G free)

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 60 -C 70 -p hdd
CRITICAL: Pool 'hdd' usage of 71.71% is above 70.0% (9703G used)

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 60 -C 70 -p nvme
CRITICAL: Pool 'nvme' usage of 76.08% is above 70.0% (223G used)

nagios$ ./check_ceph_df -i nagios -k /etc/ceph/ceph.client.nagios.keyring -W 26.14 -C 30 -d
WARNING: global RAW usage of 28.36% is above 26.14% (783G of 1093G free)

 POOLS:
     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS
     rbd                 0      96137M      8.59          348G       24441
     cephfs_data         1      61785M      5.52          348G       99940
     cephfs_metadata     2      40380k         0          348G        8037
     libvirt-pool        3         145         0          348G           2

check_ceph_mds

The check_ceph_mds nagios plugin monitors an individual mds daemon, reporting its status.

Possible result includes OK, WARN (laggy) and Error (not found).

Usage

usage: check_ceph_mds [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                      [-k KEYRING] [-V] -n NAME -f FILESYSTEM

'ceph mds stat' nagios plugin.

optional arguments:
  -h, --help            show this help message and exit
  -e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
  -c CONF, --conf CONF  alternative ceph conf file
  -m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor to use for queries (address[:port])
  -i ID, --id ID        ceph client id
  -k KEYRING, --keyring KEYRING
                        ceph client keyring file
  -V, --version         show version and exit
  -n NAME, --name NAME  mds daemon name
  -f FILESYSTEM, --filesystem FILESYSTEM
                        mds filesystem name

Example

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-1
MDS OK: MDS 'ceph-mds-1' is up:active

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-2
MDS OK: MDS 'ceph-mds-2' is up:standby

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-1
MDS WARN: MDS 'ceph-mds-1' is up:active (laggy or crashed)

nagios$ ./check_ceph_mds -f cephfs -n ceph-mds-3
MDS ERROR: MDS 'ceph-mds-3' is not found (offline?)

check_ceph_mgr

The check_ceph_mgr nagios plugin monitors the mgr.

Usage

usage: check_ceph_mgr [-h] [-e EXE] [-c CONF] [-m MONADDRESS] [-i ID]
                    [-n NAME] [-k KEYRING] [-V]

'ceph mgr dump' nagios plugin.

optional arguments:
-h, --help            show this help message and exit
-e EXE, --exe EXE     ceph executable [/usr/bin/ceph]
-c CONF, --conf CONF  alternative ceph conf file
-m MONADDRESS, --monaddress MONADDRESS
                        ceph monitor to use for queries (address[:port])
-i ID, --id ID        ceph client id
-n NAME, --name NAME  ceph client name
-k KEYRING, --keyring KEYRING
                        ceph client keyring file
-V, --version         show version and exit

Example

nagios$ ./check_ceph_mgr
MGR OK: active: zhdk0013, standbys: zhdk0009, zhdk0025

check_ceph_osd_db

The check_ceph_osd_db checks the percentage usage of the BlueStore DB for the OSD and reports it as critical if it's above the threshold.

check_ceph_osd_frag

The check_ceph_osd_frag checks the percentage fragmentation of the BlueStore DB for the OSD and reports it as critical if it's above the threshold. See RedHat documentation for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 240 Commits
config		config
debian		debian
src		src
.gitignore		.gitignore
AUTHORS		AUTHORS
BUILD_DEBIAN.md		BUILD_DEBIAN.md
CHANGELOG		CHANGELOG
COPYRIGHT		COPYRIGHT
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nagios plugins for Ceph

Authentication

check_ceph_health

Usage

Example

check_ceph_mon

Usage

Example

check_ceph_osd

Usage

Example

check_ceph_rgw

Usage

Example

check_ceph_rgw_api

Difference with `check_ceph_rgw`:

Possible results

Requirements

Usage

Example

check_ceph_df

Usage

Example

check_ceph_mds

Usage

Example

check_ceph_mgr

Usage

Example

check_ceph_osd_db

check_ceph_osd_frag

About

Releases

Packages

Languages

License

bluikko/ceph-nagios-plugins

Folders and files

Latest commit

History

Repository files navigation

Nagios plugins for Ceph

Authentication

check_ceph_health

Usage

Example

check_ceph_mon

Usage

Example

check_ceph_osd

Usage

Example

check_ceph_rgw

Usage

Example

check_ceph_rgw_api

Difference with check_ceph_rgw:

Possible results

Requirements

Usage

Example

check_ceph_df

Usage

Example

check_ceph_mds

Usage

Example

check_ceph_mgr

Usage

Example

check_ceph_osd_db

check_ceph_osd_frag

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Difference with `check_ceph_rgw`:

Packages