Hypervisors flapping states in PHX engine
Description
Activity

Former user June 23, 2017 at 2:30 PM
I updated the Engine to 4.1.2 today, closing the ticket for now and if this issue repeats we can re-open it and file a bug. ovirt-srv02 was also upgraded to 4.1.2 and is back in production.

Former user June 22, 2017 at 2:57 PM
Looks like VDSM is unable to acquire a lock on ovirt-srv02 which is causing HA Agent startup failures. I am now evacuating VMs from it to reboot the machine and install updates.

Former user June 22, 2017 at 11:07 AM
Thanks for the info. Looks like a good reason - the Engine was soft-fencing nodes when it thought it can't connect to them.
Strangely enough, the agent is still not able to initialize the domain monitor and failing. Not sure what can be done here other than shutting down the engine VM and hoping that another host can get the lock.

Martin Sivak June 22, 2017 at 9:33 AM
Yep, try it again. It only means VDSM took too long to respond.. as it tends to do after restart sometimes.

Former user June 22, 2017 at 8:50 AM
When I started looking into this, all host went NonResponsive - an engine service restart seems to have fixed this so looks like some kind of bug with connectivity.
Will schedule an engine update to see if this fixes this issue.
Currently, one of the HE hosts has its agent turned off so the engine VM is in a strange state. is it safe to just start the HA agent given these log messages?
MainThread::INFO::2017-06-22 08:33:52,978::hosted_engine::848::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:_get_domain_monitor_status) VDSM domain monitor status: PENDING
MainThread::ERROR::2017-06-22 08:33:52,981::hosted_engine::822::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=d623f2f4-1e41-43f4-a202-6ee810fe3324, host_id=2): timeout during domain acquisition
MainThread::WARNING::2017-06-22 08:33:52,982::hosted_engine::469::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:start_monitoring) Error while monitoring engine: Failed to start monitoring domain (sd_uuid=d623f2f4-1e41-43f4-a202-6ee810fe3324, host_id=2): timeout during domain acquisition
MainThread::WARNING::2017-06-22 08:33:52,982::hosted_engine::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:start_monitoring) Unexpected error
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 443, in start_monitoring
self._initialize_domain_monitor()
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 823, in _initialize_domain_monitor
raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=d623f2f4-1e41-43f4-a202-6ee810fe3324, host_id=2): timeout during domain acquisition
MainThread::ERROR::2017-06-22 08:33:52,983::hosted_engine::485::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:start_monitoring) Shutting down the agent because of 3 failures in a row!
MainThread::WARNING::2017-06-22 08:33:56,008::hosted_engine::755::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2017-06-22 08:33:56,013::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent:run) Agent shutting down
Details
Assignee
Former userFormer user(Deactivated)Reporter
Former userFormer user(Deactivated)Components
Priority
High
Details
Details
Assignee

Reporter

Hosts seem to be going to Non Responsive state for short periods of time in PHX. As we use local storage this does not affect VMs, yet this is not normal and needs to be fixed.