VMs on iSCSI paused due to lack of storage space
Description
Activity

Former user November 16, 2017 at 10:21 AMEdited
I've seen this two more times when applying updates to VMs. Last time it happened to staging-shift-master03.phx.ovirt.org on ovirt-srv03
Looks like a VDSM bug so will update to 4.1.7 first before opening a BZ.

Former user October 13, 2017 at 2:43 PM
as we need to have periodic backups running and I couldn't reproduce the issue, I restarted the backup server and it came up just fine. Writing to the data disk went fine as well and I did see extend requests sent through mailbox-hsm.
Not sure what exactly caused this issue but it'll definitely be hard to reproduce: the VM had an uptime of 170 days and survived multiple live migrations during host upgrades. I also moved its disks from NFS to iSCSI a few months ago but the 4.1.4 -> 4.1.6 upgrade seems to have triggered the issue in some form. We need to be careful if any other important VMs behave this way, and once Jenkins and Resources are moved off the old NFS storage - power cycle them to ensure we don't see this glitch.
Closing for now.

Former user October 13, 2017 at 1:53 PM
Here's the VDSM log of the VM getting paused:
2017-10-13 02:51:04,152+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') abnormal vm stop device virtio-disk1 error enospc (vm:4211)
2017-10-13 02:51:04,153+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') CPU stopped: onIOError (vm:5093)
2017-10-13 02:51:04,154+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') No VM drives were extended (vm:4218)
2017-10-13 02:51:04,155+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') CPU stopped: onSuspend (vm:5093)
When trying to resume it, the same thing happens:
2017-10-13 13:10:25,380+0000 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (_init_:539)
2017-10-13 13:10:25,486+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') abnormal vm stop device virtio-disk1 error enospc (vm:4211)
2017-10-13 13:10:25,868+0000 INFO (jsonrpc/4) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') CPU running: continue (vm:5093)
2017-10-13 13:10:25,870+0000 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call VM.cont succeeded in 0.45 seconds (_init_:539)
2017-10-13 13:10:25,888+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') CPU stopped: onIOError (vm:5093)
2017-10-13 13:10:25,889+0000 INFO (libvirt/events) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') No VM drives were extended (vm:4218)
The symptoms look familiar to bz1461536 which is marked as "closed". The thin disk needed extension during backup tonight but that was not performed. The last extend request was logged on Tuesday, before the [upgrade to 4.1.6|browse/OVIRT-1590] was performed:
2017-10-10 02:47:35,731+0000 INFO (periodic/11) [virt.vm] (vmId='c1ad06ec-208c-4bec-a525-38a96772f32a') Requesting extension for volume 5074d539-75a7-46b5-b0f7-74ff98f0c3fc on domain a04ef900-8e2d-4c39-bd66-8d8cf8549d35 (apparent: 465332862976, capacity: 697932185600, allocated: 464933879808, physical: 465332862976) (vm:909)
Not sure what's up here, maybe the logging format changed in 4.1.6. Looking at the nice guide of what log lines should be present in the bz there seems to be no communication using the HSM mailbox after the upgrade and no extension requests are logged on the hypervisor where the VM is running.
Software versions installed:
libvirt-python-3.2.0-3.el7.x86_64
vdsm-4.19.31-1.el7.centos.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
Will open a new BZ as this does look like a serious bug at this point as it may affect other VMs we have running.
Details
Assignee
Former userFormer user(Deactivated)Reporter
Former userFormer user(Deactivated)Priority
High
Details
Details
Assignee

Reporter

The Backup VM was paused due to lack of storage space while there's still hundreds of gigabytes available on the iSCSI storage domain. Opening ticket to investigate as attempts to unpause the VM do not seem to help