fix nagios alerts for Jenkins disk

Description

Nagios did not alert about disk space running out on Jenkins which eventually caused an outage this morning.

Activity

Show:

Nadav Goldin November 10, 2016 at 8:17 PM

missed that & good to know. thanks for the fix.

Former user November 10, 2016 at 3:52 PM

Patch merged, notifications are fixed now

Former user November 10, 2016 at 2:19 PM

Patch created to fix the config:
https://gerrit.ovirt.org/66395

Former user November 10, 2016 at 2:14 PM
Edited

Just as a test - running the plugin directly on the Jenkins partition with 43% percent free and a 50% free space warning threshold:

current config - integers - no warning

  1. /usr/lib64/nagios/plugins/check_disk -w 50 -c 10 /var/lib/data
    DISK OK - free space: /var/lib/data 518040 MB (43% inode=99%);| /var/lib/data=659359MB;1177350;1177390;0;1177400

correct config - percents - warning as expected

  1. /usr/lib64/nagios/plugins/check_disk -w 50% -c 10% /var/lib/data
    DISK WARNING - free space: /var/lib/data 518039 MB (43% inode=99%);| /var/lib/data=659360MB;588700;1059660;0;1177400

for reference - config with 600000 as threshold (600GB free) - warning displayed as only 518GB is free

  1. /usr/lib64/nagios/plugins/check_disk -w 600000 -c 10 /var/lib/data
    DISK WARNING - free space: /var/lib/data 518030 MB (43% inode=99%);| /var/lib/data=659369MB;577400;1177390;0;1177400

Former user November 10, 2016 at 2:09 PM

Looks like there's some misconfiguration of the service as Nagios was reporting "OK" state for the last few months even though the disk was filling up:

08.11.2016
[1478563200] CURRENT SERVICE STATE: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;OK;HARD;1;DISK OK - free space: /var/lib/data 37949 MB (3% inode=97%):
09.11.2016
[1478649600] CURRENT SERVICE STATE: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;OK;HARD;1;DISK OK - free space: /var/lib/data 33181 MB (2% inode=97%):
10.11.2016
[1478736000] CURRENT SERVICE STATE: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;OK;HARD;1;DISK OK - free space: /var/lib/data 19441 MB (1% inode=95%):
[1478738385] SERVICE ALERT: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;CRITICAL;SOFT;1;DISK CRITICAL - free space: /var/lib/data 0 MB (0% inode=0%):

Nagios checks check_data_disk on the Jenkins server via NRPE, where it's configured as follows:

command[check_data_disk]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 /var/lib/data

This is not correct according to the manual:
https://www.monitoring-plugins.org/doc/man/check_disk.html

If warning/critical thresholds are integers they are accounted as units (i.e. Megabytes), a percent sign must be added to fix this.
These configs are puppet managed I assume, may know more - I'll check them out and add a patch to fix monitoring plugin definitions.

Fixed

Details

Assignee

Reporter

Priority

Created November 10, 2016 at 1:48 PM
Updated November 29, 2016 at 12:18 PM
Resolved November 10, 2016 at 3:52 PM