[monitoring] add monitor to critical services

Description

Add production servers monitoring including:

artifactory.ovirt.org/ovirt-mirror access
proxy access
jenkins
mailman..

any service that affects the env.

make sure infra@ovirt.org list is updated on each failure.

general steps:
1. create new VM on production DC - monitoring.phx.ovirt.org
2. install el7
3. choose monitoring system (we might have a few running on it) - want want to start with nagios/icingna
4. puppetize the server
5. start adding monitoring for all hypervisors/vms/disk spaces /etc.,..

best to open a google sheet and document which server/service we are monitoring.

but lets start with installing / creating vm.

Activity

Show:
Evgheni Dereveanchin
March 10, 2017, 4:08 PM

Updated list of all services in PHX requiring monitoring

VM

needs monitoring

is monitored

services

artifactory

http,disk

backup

disk

foreman

dns,dhcp,disk

gerrit-staging

-

-

http,gerrit,postgres,disk

glance

http,disk,mysql

graphite

http,disk

gw01

dns,dhcp,openvpn,disk

engine

http,ovirt-engine,postgres,disk

jenkins

http,jenkins,disk

jenkins-staging

-

http,jenkins,disk

lists

http,postfix,courier,disk

mail

http,postfix,courier,disk

mirrors

http,disk

monitoring

http,nagios,disk (not PROD yet)

openshift

http,origin-master,origin-node,disk

openshift-staging

-

-

http,origin-master,origin-node,disk

proxy

squid,disk (not all monitored)

resources

http,disk

stats

-

-

http,disk (not PROD yet)

templates

http,disk

Also there are the physical servers which need to be monitored. Hosts need to be added to monitoring as well as the hosts in the internal VLAN by setting up passive checks and forwarding them from the PHX Nagios.

Evgheni Dereveanchin
March 10, 2017, 5:09 PM

Basic health monitoring for storage added to verify that both hosts respond to ping as well as the cluster IP.

Eyal Edri
March 12, 2017, 7:18 AM

unblocking, lets add what we can to the existing Nagios, it might take more time to reinstall the alterway server for it.

Eyal Edri
June 10, 2018, 3:46 PM

this should probably be moved to be an Epic and include smaller tasks for specific monitoring of services we are still lacking.

Eyal Edri
December 15, 2018, 2:40 PM

Please use this Epic for attaching tickets about monitoring to the remaining critical services, do we need to update the list with what we have monitoring today?

Assignee

Evgheni Dereveanchin

Reporter

Eyal Edri

Blocked By

pending on reinstall of alterway server

Components

Priority

High
Configure