Nagios ran out of memory

Description

There was a flood of Nagios notifications this morning which looked like false positives.
While investigating I stumbled upon numerous OOM conditions in logs so memory for the VM should be increased.

Activity

Show:

Marc Dequènes (Duck) August 10, 2018 at 5:11 AM

First, I increased the RAM (or doubled in this case), so it should be fine now.

As for the network condition, there was a planned outage scheduled by the link provider and I think it was not believed to have such impact. Also I don't know why the link redundancy did not help, so we're discussing the issue and there will be more detailed information on the Community Cage ML (where you should already be subscribed).

Former user August 7, 2018 at 7:35 AM

As for the reasons of the flood of false positives, there is no direct indication in the logs but seems like the VM lost networking completely as everything including the MailMan VM (located in the same OSAS cage) was reported as down:

Aug 07 04:10:23 monitoring.ovirt.org nagios[13103]: HOST ALERT: alterway02.ovirt.org;DOWN;SOFT;1;CRITICAL - Network Unreachable (89.31.150.216)
Aug 07 04:11:09 monitoring.ovirt.org nagios[13103]: HOST ALERT: engine-phx.ovirt.org;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)
Aug 07 04:11:21 monitoring.ovirt.org nagios[13103]: HOST ALERT: lists.ovirt.org;DOWN;SOFT;1;check_ping: Invalid hostname/address - lists.ovirt.org
Aug 07 04:12:20 monitoring.ovirt.org nagios[13103]: HOST ALERT: gerrit.ovirt.org;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)

was there any recorded outage at OSAS that could have caused that?

Fixed

Details

Assignee

Reporter

Priority

Created August 7, 2018 at 7:17 AM
Updated September 2, 2018 at 3:50 PM
Resolved August 10, 2018 at 5:11 AM