Jenkins outage 26.07.2017

Description

Jenkins VM got paused due to I/O errors on the storage. I resumed the VM and it is operational again, opening a ticket to investigate the reason for this outage.

Activity

Show:
Evgheni Dereveanchin
July 26, 2017, 9:33 PM
Edited

Engine audit log has the following entries before the outage:

Jul 26, 2017 7:57:16 PM

Host ovirt-srv01 has network interface which exceeded the defined threshold [95%] (eno2: transmit rate[0%], receive rate [97%])

Jul 26, 2017 9:27:14 PM

Host ovirt-srv01 has network interface which exceeded the defined threshold [95%] (eno2: transmit rate[0%], receive rate [97%])

Jul 26, 2017 9:48:20 PM

VM jenkins-phx-ovirt-org is not responding.

Jul 26, 2017 9:54:22 PM

VM jenkins-phx-ovirt-org is not responding.

Jul 26, 2017 9:54:37 PM

VM jenkins-phx-ovirt-org has been paused.

Jul 26, 2017 9:54:37 PM

VM jenkins-phx-ovirt-org has been paused due to storage I/O problem.

host vdsm.log has lines like htis:

Jul 26 19:47:54 ovirt-srv01.ovirt.org libvirtd[1675]: Cannot start job (query, none) for domain jenkins-phx-ovirt-org; current job is (query, none) owned by (1802 remoteDispatchDomainG
Jul 26 19:47:54 ovirt-srv01.ovirt.org libvirtd[1675]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
Jul 26 19:48:11 ovirt-srv01.ovirt.org libvirtd[1675]: Cannot start job (query, none) for domain jenkins-phx-ovirt-org; current job is (query, none) owned by (1802 remoteDispatchDomainG
Jul 26 19:48:11 ovirt-srv01.ovirt.org libvirtd[1675]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
Jul 26 19:48:41 ovirt-srv01.ovirt.org libvirtd[1675]: Cannot start job (query, none) for domain jenkins-phx-ovirt-org; current job is (query, none) owned by (1802 remoteDispatchDomainG
Jul 26 19:48:41 ovirt-srv01.ovirt.org libvirtd[1675]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
...
Jul 26 20:26:37 ovirt-srv01.ovirt.org vdsm[1278]: vdsm vds.dispatcher WARN unhandled write event
Jul 26 20:26:37 ovirt-srv01.ovirt.org vdsm[1278]: vdsm vds.dispatcher WARN unhandled write event
Jul 26 20:26:37 ovirt-srv01.ovirt.org vdsm[1278]: vdsm vds.dispatcher WARN unhandled write event

This is the only VM that got paused, need to dig deeper to confirm why. It may have been storage slowness, network load on host or some libvirt issue for example.

Evgheni Dereveanchin
July 27, 2017, 12:02 PM

The log of the qemu domain shows no I/O errors, nor does dmesg. Also, this was the sole VM that got paused.
Lines from journalctl noted previously may explain the cause:

Jul 26 19:47:54 ovirt-srv01.ovirt.org libvirtd[1675]: Cannot start job (query, none) for domain jenkins-phx-ovirt-org; current job is (query, none) owned by (1802 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (43s, 0s)
Jul 26 19:47:54 ovirt-srv01.ovirt.org libvirtd[1675]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
...
Jul 26 19:53:54 ovirt-srv01.ovirt.org libvirtd[1675]: Cannot start job (query, none) for domain jenkins-phx-ovirt-org; current job is (query, none) owned by (1802 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (403s, 0s)
Jul 26 19:53:54 ovirt-srv01.ovirt.org libvirtd[1675]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
Jul 26 19:54:10 ovirt-srv01.ovirt.org kernel: nfs: server 66.187.230.61 not responding, timed out

Symptoms are close to this bug - the NFS server didn't respond to a request and the timeout caused I/O to fail and the VM to get paused.

As no other VMs or sanlock processes got stuck, it must have been a temporary glitch causing the monitoring thread to hang on this particular VM. It's the first time we see this issue, so closing for now.

Assignee

Evgheni Dereveanchin

Reporter

Evgheni Dereveanchin

Blocked By

None

Components

Priority

High
Configure