OST jobs fails on "address already in use"

Description

Evgheni,
Was there any change recently to Lago slaves?

On Fri, Oct 20, 2017 at 11:05 AM, Piotr Kliczewski <
piotr.kliczewski@gmail.com> wrote:

> I attempted to run manual OST twice and both failed with below issue.
> Can someone take a look?
>
> Thanks,
> Piotr
>
> 2017-10-20 07:59:12,485::log_utils.py::_exit_::607::ovirtlago.prefix:
> EBUG::
> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
> in wrapper
> return func(*args, **kwargs)
> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
> 111, in wrapper
> with utils.repo_server_context(args[0]):
> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
> return self.gen.next()
> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
> 100, in repo_server_context
> root_dir=prefix.paths.internal_repo(),
> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
> in _create_http_server
> generate_request_handler(root_dir),
> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
> self.server_bind()
> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
> SocketServer.TCPServer.server_bind(self)
> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
> self.socket.bind(self.server_address)
> File "/usr/lib64/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
>
> 2017-10-20 07:59:12,485::cmd.py::do_run::365::root::ERROR::Error
> occured, aborting
> Traceback (most recent call last):
> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 362, in
> do_run
> self.cli_plugins[args.ovirtverb].do_run(args)
> File "/usr/lib/python2.7/site-packages/lago/plugins/cli.py", line
> 184, in do_run
> self._do_run(**vars(args))
> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 501, in
> wrapper
> return func(*args, **kwargs)
> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 512, in
> wrapper
> return func(*args, prefix=prefix, **kwargs)
> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 166,
> in do_deploy
> prefix.deploy()
> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
> in wrapper
> return func(*args, **kwargs)
> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
> 111, in wrapper
> with utils.repo_server_context(args[0]):
> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
> return self.gen.next()
> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
> 100, in repo_server_context
> root_dir=prefix.paths.internal_repo(),
> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
> in _create_http_server
> generate_request_handler(root_dir),
> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
> self.server_bind()
> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
> SocketServer.TCPServer.server_bind(self)
> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
> self.socket.bind(self.server_address)
> File "/usr/lib64/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 98] Address already in use
> _______________________________________________
> Infra mailing list
> Infra@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>
>

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)

Activity

Show:

Eyal Edri October 31, 2017 at 2:41 PM

Not sure if there anything else to do here, other than solving it on Lago side and we have a ticket there.
Feel free to close if we found the source ( networking suite ) and educated the maintainer how to use lago serve

Former user October 31, 2017 at 11:38 AM

not sure about the recommended cleanup - may be able to say more, but I did the following:
1) netstat -nlp | grep 8585
this should show the pyhton process using up the port and its PID in the last column
2) kill <PID>

Eyal Edri October 30, 2017 at 9:27 AM

This is happening for me locally now, even with cleaning the networks and running lago destroy, what is the recommended cleanup action needed to resolve this?

Former user October 23, 2017 at 12:24 PM

I've seen the issue at least on the following bare metals on Friday:

ovirt-srv17
ovirt-srv18
ovirt-srv21
ovirt-srv22
ovirt-srv23

Gal Ben Haim October 23, 2017 at 8:03 AM

This issue is caused when calling to "lago ovirt serve" (which starts the
repo server) as a subprocess, and not making sure to kill it when it's not
needed anymore (or on failure).

In the past, VDSM's check patch was coded like this, but we fixed it. Could
be that the same bug exists in another suite.

Evgheni, can you specify a slave that had this issue?

On Fri, Oct 20, 2017 at 3:44 PM, Evgheni Dereveanchin <ederevea@redhat.com>
wrote:

> I agree with Barak - checked the slave that was failing and there was a
> process still listening to port 8585.
> The slave was put offline the slave but attempting to run the job on a
> different one caused the exact same error.
> As more slaves are affected this may be a lago bug. No changes were made
> on slaves this week.
>
> On Fri, Oct 20, 2017 at 10:46 AM, Barak Korren <bkorren@redhat.com> wrote:
>
>> looks like there might be a lago localrepo process process left up on the
>> slave from a previous run
>>
>> On 20 October 2017 at 11:26, Eyal Edri <eedri@redhat.com> wrote:
>>
>>> Evgheni,
>>> Was there any change recently to Lago slaves?
>>>
>>> On Fri, Oct 20, 2017 at 11:05 AM, Piotr Kliczewski <
>>> piotr.kliczewski@gmail.com> wrote:
>>>
>>>> I attempted to run manual OST twice and both failed with below issue.
>>>> Can someone take a look?
>>>>
>>>> Thanks,
>>>> Piotr
>>>>
>>>> 2017-10-20 07:59:12,485::log_utils.py::_exit_::607::ovirtlago.prefix:
>>>> EBUG::
>>>> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
>>>> in wrapper
>>>> return func(*args, **kwargs)
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
>>>> 111, in wrapper
>>>> with utils.repo_server_context(args[0]):
>>>> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
>>>> return self.gen.next()
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
>>>> 100, in repo_server_context
>>>> root_dir=prefix.paths.internal_repo(),
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
>>>> in _create_http_server
>>>> generate_request_handler(root_dir),
>>>> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
>>>> self.server_bind()
>>>> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in
>>>> server_bind
>>>> SocketServer.TCPServer.server_bind(self)
>>>> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
>>>> self.socket.bind(self.server_address)
>>>> File "/usr/lib64/python2.7/socket.py", line 224, in meth
>>>> return getattr(self._sock,name)(*args)
>>>>
>>>> 2017-10-20 07:59:12,485::cmd.py::do_run::365::root::ERROR::Error
>>>> occured, aborting
>>>> Traceback (most recent call last):
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 362,
>>>> in do_run
>>>> self.cli_plugins[args.ovirtverb].do_run(args)
>>>> File "/usr/lib/python2.7/site-packages/lago/plugins/cli.py", line
>>>> 184, in do_run
>>>> self._do_run(**vars(args))
>>>> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 501, in
>>>> wrapper
>>>> return func(*args, **kwargs)
>>>> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 512, in
>>>> wrapper
>>>> return func(*args, prefix=prefix, **kwargs)
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 166,
>>>> in do_deploy
>>>> prefix.deploy()
>>>> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
>>>> in wrapper
>>>> return func(*args, **kwargs)
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
>>>> 111, in wrapper
>>>> with utils.repo_server_context(args[0]):
>>>> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
>>>> return self.gen.next()
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
>>>> 100, in repo_server_context
>>>> root_dir=prefix.paths.internal_repo(),
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
>>>> in _create_http_server
>>>> generate_request_handler(root_dir),
>>>> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
>>>> self.server_bind()
>>>> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in
>>>> server_bind
>>>> SocketServer.TCPServer.server_bind(self)
>>>> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
>>>> self.socket.bind(self.server_address)
>>>> File "/usr/lib64/python2.7/socket.py", line 224, in meth
>>>> return getattr(self._sock,name)(*args)
>>>> error: [Errno 98] Address already in use
>>>> _______________________________________________
>>>> Infra mailing list
>>>> Infra@ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/infra
>>>>
>>>>
>>>>
>>>
>>>
>>> –
>>>
>>> Eyal edri
>>>
>>>
>>> MANAGER
>>>
>>> RHV DevOps
>>>
>>> EMEA VIRTUALIZATION R&D
>>>
>>>
>>> Red Hat EMEA <https://www.redhat.com/>
>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED.
>>> <https://redhat.com/trusted>
>>> phone: +972-9-7692018 <+972%209-769-2018>
>>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>>>
>>> _______________________________________________
>>> Infra mailing list
>>> Infra@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/infra
>>>
>>>
>>
>>
>> –
>> Barak Korren
>> RHV DevOps team , RHCE, RHCi
>> Red Hat EMEA
>> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>>
>
>
>
> –
> Regards,
> Evgheni Dereveanchin
>


GAL bEN HAIM
RHV DEVOPS

Barak Korren October 22, 2017 at 7:49 AM
Edited

I've created issue #27 on lago-ost-plugin to track improving the 'lago ovirt serve' mechanism.

Barak Korren October 22, 2017 at 6:52 AM

They also had leftovers of ovirt-master_change-queue-tester in the Jenkins work directory, so this may be the job causing the issue.

No, that last job to run on a slave always leaves its $WORKSPACE behind on that slave so if it runs on it again, some stuff is already cached for it.

We need to check the OST cleanup code and the jobs that previously ran on the slaves to see why the 'lago serve process on the port was not killed. We should probably also modify how 'lago serve' works so its less likely to influence other Lago environments trying to run on the same node, and less likely to stay behind.

Former user October 20, 2017 at 3:07 PM
Edited

I rebooted ovirt-srv21 which was failing manual tests and started a new build on it. It finished successfully and nothing was listening to port 8585 when I logged in to check after the job finished. I went through all of the bare metals and a few of them had port 8585 still occupied. They also had leftovers of ovirt-master_change-queue-tester in the Jenkins work directory, so this may be the job causing the issue.

Former user October 20, 2017 at 12:46 PM

I agree with Barak - checked the slave that was failing and there was a
process still listening to port 8585.
The slave was put offline the slave but attempting to run the job on a
different one caused the exact same error.
As more slaves are affected this may be a lago bug. No changes were made on
slaves this week.

On Fri, Oct 20, 2017 at 10:46 AM, Barak Korren <bkorren@redhat.com> wrote:

> looks like there might be a lago localrepo process process left up on the
> slave from a previous run
>
> On 20 October 2017 at 11:26, Eyal Edri <eedri@redhat.com> wrote:
>
>> Evgheni,
>> Was there any change recently to Lago slaves?
>>
>> On Fri, Oct 20, 2017 at 11:05 AM, Piotr Kliczewski <
>> piotr.kliczewski@gmail.com> wrote:
>>
>>> I attempted to run manual OST twice and both failed with below issue.
>>> Can someone take a look?
>>>
>>> Thanks,
>>> Piotr
>>>
>>> 2017-10-20 07:59:12,485::log_utils.py::_exit_::607::ovirtlago.prefix:
>>> EBUG::
>>> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
>>> in wrapper
>>> return func(*args, **kwargs)
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
>>> 111, in wrapper
>>> with utils.repo_server_context(args[0]):
>>> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
>>> return self.gen.next()
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
>>> 100, in repo_server_context
>>> root_dir=prefix.paths.internal_repo(),
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
>>> in _create_http_server
>>> generate_request_handler(root_dir),
>>> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
>>> self.server_bind()
>>> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in
>>> server_bind
>>> SocketServer.TCPServer.server_bind(self)
>>> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
>>> self.socket.bind(self.server_address)
>>> File "/usr/lib64/python2.7/socket.py", line 224, in meth
>>> return getattr(self._sock,name)(*args)
>>>
>>> 2017-10-20 07:59:12,485::cmd.py::do_run::365::root::ERROR::Error
>>> occured, aborting
>>> Traceback (most recent call last):
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 362,
>>> in do_run
>>> self.cli_plugins[args.ovirtverb].do_run(args)
>>> File "/usr/lib/python2.7/site-packages/lago/plugins/cli.py", line
>>> 184, in do_run
>>> self._do_run(**vars(args))
>>> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 501, in
>>> wrapper
>>> return func(*args, **kwargs)
>>> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 512, in
>>> wrapper
>>> return func(*args, prefix=prefix, **kwargs)
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 166,
>>> in do_deploy
>>> prefix.deploy()
>>> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
>>> in wrapper
>>> return func(*args, **kwargs)
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
>>> 111, in wrapper
>>> with utils.repo_server_context(args[0]):
>>> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
>>> return self.gen.next()
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
>>> 100, in repo_server_context
>>> root_dir=prefix.paths.internal_repo(),
>>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
>>> in _create_http_server
>>> generate_request_handler(root_dir),
>>> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
>>> self.server_bind()
>>> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in
>>> server_bind
>>> SocketServer.TCPServer.server_bind(self)
>>> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
>>> self.socket.bind(self.server_address)
>>> File "/usr/lib64/python2.7/socket.py", line 224, in meth
>>> return getattr(self._sock,name)(*args)
>>> error: [Errno 98] Address already in use
>>> _______________________________________________
>>> Infra mailing list
>>> Infra@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/infra
>>>
>>>
>>>
>>
>>
>> –
>>
>> Eyal edri
>>
>>
>> MANAGER
>>
>> RHV DevOps
>>
>> EMEA VIRTUALIZATION R&D
>>
>>
>> Red Hat EMEA <https://www.redhat.com/>
>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
>> phone: +972-9-7692018 <+972%209-769-2018>
>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>>
>> _______________________________________________
>> Infra mailing list
>> Infra@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/infra
>>
>>
>
>
> –
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>


Regards,
Evgheni Dereveanchin

Barak Korren October 20, 2017 at 8:48 AM

looks like there might be a lago localrepo process process left up on the
slave from a previous run

On 20 October 2017 at 11:26, Eyal Edri <eedri@redhat.com> wrote:

> Evgheni,
> Was there any change recently to Lago slaves?
>
> On Fri, Oct 20, 2017 at 11:05 AM, Piotr Kliczewski <
> piotr.kliczewski@gmail.com> wrote:
>
>> I attempted to run manual OST twice and both failed with below issue.
>> Can someone take a look?
>>
>> Thanks,
>> Piotr
>>
>> 2017-10-20 07:59:12,485::log_utils.py::_exit_::607::ovirtlago.prefix:
>> EBUG::
>> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
>> in wrapper
>> return func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
>> 111, in wrapper
>> with utils.repo_server_context(args[0]):
>> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
>> return self.gen.next()
>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
>> 100, in repo_server_context
>> root_dir=prefix.paths.internal_repo(),
>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
>> in _create_http_server
>> generate_request_handler(root_dir),
>> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
>> self.server_bind()
>> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
>> SocketServer.TCPServer.server_bind(self)
>> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
>> self.socket.bind(self.server_address)
>> File "/usr/lib64/python2.7/socket.py", line 224, in meth
>> return getattr(self._sock,name)(*args)
>>
>> 2017-10-20 07:59:12,485::cmd.py::do_run::365::root::ERROR::Error
>> occured, aborting
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 362, in
>> do_run
>> self.cli_plugins[args.ovirtverb].do_run(args)
>> File "/usr/lib/python2.7/site-packages/lago/plugins/cli.py", line
>> 184, in do_run
>> self._do_run(**vars(args))
>> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 501, in
>> wrapper
>> return func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/lago/utils.py", line 512, in
>> wrapper
>> return func(*args, prefix=prefix, **kwargs)
>> File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 166,
>> in do_deploy
>> prefix.deploy()
>> File "/usr/lib/python2.7/site-packages/lago/log_utils.py", line 636,
>> in wrapper
>> return func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/ovirtlago/reposetup.py", line
>> 111, in wrapper
>> with utils.repo_server_context(args[0]):
>> File "/usr/lib64/python2.7/contextlib.py", line 17, in _enter_
>> return self.gen.next()
>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line
>> 100, in repo_server_context
>> root_dir=prefix.paths.internal_repo(),
>> File "/usr/lib/python2.7/site-packages/ovirtlago/utils.py", line 76,
>> in _create_http_server
>> generate_request_handler(root_dir),
>> File "/usr/lib64/python2.7/SocketServer.py", line 419, in _init_
>> self.server_bind()
>> File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
>> SocketServer.TCPServer.server_bind(self)
>> File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
>> self.socket.bind(self.server_address)
>> File "/usr/lib64/python2.7/socket.py", line 224, in meth
>> return getattr(self._sock,name)(*args)
>> error: [Errno 98] Address already in use
>> _______________________________________________
>> Infra mailing list
>> Infra@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/infra
>>
>>
>>
>
>
> –
>
> Eyal edri
>
>
> MANAGER
>
> RHV DevOps
>
> EMEA VIRTUALIZATION R&D
>
>
> Red Hat EMEA <https://www.redhat.com/>
> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
> phone: +972-9-7692018 <+972%209-769-2018>
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>
> _______________________________________________
> Infra mailing list
> Infra@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>


Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Fixed

Details

Assignee

Reporter

Priority

Created October 20, 2017 at 8:28 AM
Updated February 28, 2018 at 3:33 PM
Resolved February 28, 2018 at 2:25 PM