fix nagios alerts for Jenkins disk
Description
Activity

Nadav Goldin November 10, 2016 at 8:17 PM
missed that & good to know. thanks for the fix.

Former user November 10, 2016 at 3:52 PM
Patch merged, notifications are fixed now

Former user November 10, 2016 at 2:19 PM
Patch created to fix the config:
https://gerrit.ovirt.org/66395

Former user November 10, 2016 at 2:14 PMEdited
Just as a test - running the plugin directly on the Jenkins partition with 43% percent free and a 50% free space warning threshold:
current config - integers - no warning
/usr/lib64/nagios/plugins/check_disk -w 50 -c 10 /var/lib/data
DISK OK - free space: /var/lib/data 518040 MB (43% inode=99%);| /var/lib/data=659359MB;1177350;1177390;0;1177400
correct config - percents - warning as expected
/usr/lib64/nagios/plugins/check_disk -w 50% -c 10% /var/lib/data
DISK WARNING - free space: /var/lib/data 518039 MB (43% inode=99%);| /var/lib/data=659360MB;588700;1059660;0;1177400
for reference - config with 600000 as threshold (600GB free) - warning displayed as only 518GB is free
/usr/lib64/nagios/plugins/check_disk -w 600000 -c 10 /var/lib/data
DISK WARNING - free space: /var/lib/data 518030 MB (43% inode=99%);| /var/lib/data=659369MB;577400;1177390;0;1177400

Former user November 10, 2016 at 2:09 PM
Looks like there's some misconfiguration of the service as Nagios was reporting "OK" state for the last few months even though the disk was filling up:
08.11.2016
[1478563200] CURRENT SERVICE STATE: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;OK;HARD;1;DISK OK - free space: /var/lib/data 37949 MB (3% inode=97%):
09.11.2016
[1478649600] CURRENT SERVICE STATE: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;OK;HARD;1;DISK OK - free space: /var/lib/data 33181 MB (2% inode=97%):
10.11.2016
[1478736000] CURRENT SERVICE STATE: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;OK;HARD;1;DISK OK - free space: /var/lib/data 19441 MB (1% inode=95%):
[1478738385] SERVICE ALERT: jenkins.phx.ovirt.org;jenkins.phx.ovirt.org /var/lib/data disk;CRITICAL;SOFT;1;DISK CRITICAL - free space: /var/lib/data 0 MB (0% inode=0%):
Nagios checks check_data_disk on the Jenkins server via NRPE, where it's configured as follows:
command[check_data_disk]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 /var/lib/data
This is not correct according to the manual:
https://www.monitoring-plugins.org/doc/man/check_disk.html
If warning/critical thresholds are integers they are accounted as units (i.e. Megabytes), a percent sign must be added to fix this.
These configs are puppet managed I assume, may know more - I'll check them out and add a patch to fix monitoring plugin definitions.
Details
Assignee
Former userFormer user(Deactivated)Reporter
Former userFormer user(Deactivated)Priority
Medium
Details
Details
Assignee

Reporter

Nagios did not alert about disk space running out on Jenkins which eventually caused an outage this morning.