Jenkins snapshot creation failed
Description
Activity
Former user July 8, 2016 at 2:38 PM
As we checked the logs and identified the root cause and possible improvements I am closing this case

Nadav Goldin June 26, 2016 at 7:04 AM
the VM did not have 'ovirt-guest-agent' installed, maybe this could have also contributed to the failure?
I installed it now and will send a puppet patch to update.
Former user June 24, 2016 at 9:25 AM
The memory dump file is 33498670080 bytes (32GB) in size.
It took 11 minutes to copy it means the average speed was
around 40 megabytes per second. As the logs show storage
latency errors on other hosts during this time, it means
the storage was overwhelmed again - just not by builds,
but by this single consecutive write during snapshotting.
Similar messages can be seen during snapshotting artifactory
earlier the same day, but as that VM has less RAM it managed
to dump RAM within 3 minutes and succeeded.
Former user June 24, 2016 at 9:13 AM
Here are host logs. As suspected, I did not find any errors, the VM was suspended at 09:06:47 MST and resumed at 09:17:26 MST with Thread-7164775 returning successfully after copying 33498670080 bytes of RAM to the storage domain.
Thread-7164775:EBUG::2016-06-23 09:06:45,911::BindingXMLRPC::1133::vds:wrapper) client [66.187.230.60]::call vmSnapshot with ('e7a7b735-0310-4f88-9ed9-4fed85835a01', [{'baseVolumeID': 'f37836c6-4bbe-4c8d-abf4-275cf461262e', 'domainID': 'ba023ff2-4e0e-4a32-86f3-923414206667', 'volumeID': '3b105e9b-53fe-4452-be71-2ac2182ecfec', 'imageID': '140adf46-fce4-4dba-980d-37d91416b12b'}], 'ba023ff2-4e0e-4a32-86f3-923414206667,00000002-0002-0002-0002-000000000150,2beb0ee6-b70b-4f48-bdd9-d89650383d61,daef68b9-5967-4047-9b17-1f55b68e5d8a,3580f2a1-a55a-47d0-9e67-627afbc0f2da,6c20093d-a5f3-407a-8986-ca26a488cb20') {}
...
Thread-7164775:EBUG::2016-06-23 09:06:47,459::vm::4432::vm.Vm:snapshot) vmId=`e7a7b735-0310-4f88-9ed9-4fed85835a01`::<domainsnapshot>
<disks>
<disk name="vda" snapshot="external" type="file">
<source file="/rhev/data-center/00000002-0002-0002-0002-000000000150/ba023ff2-4e0e-4a32-86f3-923414206667/images/140adf46-fce4-4dba-980d-37d91416b12b/3b105e9b-53fe-4452-be71-2ac2182ecfec" type="file"/>
</disk>
</disks>
<memory file="/rhev/data-center/00000002-0002-0002-0002-000000000150/ba023ff2-4e0e-4a32-86f3-923414206667/images/2beb0ee6-b70b-4f48-bdd9-d89650383d61/daef68b9-5967-4047-9b17-1f55b68e5d8a" snapshot="external"/>
</domainsnapshot>
...
libvirtEventLoop:EBUG::2016-06-23 09:06:47,645::vm::5571::vm.Vm:_onLibvirtLifecycleEvent) vmId=`e7a7b735-0310-4f88-9ed9-4fed85835a01`::event Suspended detail 0 opaque None
...
Thread-7164775:EBUG::2016-06-23 09:17:26,338::outOfProcess::169::Storage.oop:padToBlockSize) Truncating file /rhev/data-center/00000002-0002-0002-0002-000000000150/ba023ff2-4e0e-4a32-86f3-923414206667/images/2beb0ee6-b70
b-4f48-bdd9-d89650383d61/daef68b9-5967-4047-9b17-1f55b68e5d8a to 33498670080 bytes
...
libvirtEventLoop:EBUG::2016-06-23 09:17:26,317::vm::5571::vm.Vm:_onLibvirtLifecycleEvent) vmId=`e7a7b735-0310-4f88-9ed9-4fed85835a01`::event Resumed detail 0 opaque None
...
Thread-7164775:EBUG::2016-06-23 09:17:26,450::BindingXMLRPC::1140::vds:wrapper) return vmSnapshot with {'status': {'message': 'Done', 'code': 0}, 'quiesce': False}
On Engine the process timed out after 3 minutes and in reality it took 11 minutes. This suggests the snapshot is likely completely healthy, I'll take a sosreport from the host just in case we need to further investigate this, maybe can check the logs for more clues.
Former user June 24, 2016 at 8:29 AM
we're working on this with Anton in and so far the results are good. I'll prepare more hosts to make use of SSDs on them and we can proceed with Prod migration next week after deciding the safest way to do so. In any case, as the VM was already rebooted and came up, there should be no issues with the disk image chain.
issued a live snapshot creation on the Jenkins VM to prepare it for cluster move. This failed and it's not really clear why. Relevant event logs below, suggesting that the hypervisor started dumping VM memory to the snapshot which caused a storage slowdown.