Failing KubeVirt CI

Description

Hi,

I am working on fixing the issues on the KubeVirt e2e test suites. This
task is directly related to unstable CI, due to unknown errors.
The progress is reported in the CNV trello:
https://trello.com/c/HNXcMEQu/161-epic-improve-ci

I am creating this issue since the KubeVirt experience random timeouts on
random tests most of the times when test suites run.
The issue from outside is showing as timeouts on difference part of tests.
Sometimes the tests fails in set up phase, again due to random timeout.
The example in the link bellow timed out for network connection on
localhost.

[check-patch.k8s-1.11.0-dev.el7.x86_64]
requests.exceptions.ReadTimeout:
UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
(read timeout=60)

Example of failing test suites is here
https://jenkins.ovirt.org/job/kubevirt_kubevirt_standard-check-pr/1916/consoleText

The list of errors related to the failing CI can be found in my notes
https://docs.google.com/document/d/1_ll1DOMHgCRHn_Df9i4uvtRFyMK-bDCHEeGfJFTjvjU/edit#heading=h.vcfoo8hi48ul

I am not sure whether KubeVirt already shared the resource requirements, so
I provide short summary:
Resources for KubeVirt e2e tests:

  • at least 12GB of RAM - we start 3 nodes (3 docker images) each require
    4GB of RAM

  • exposed /dev/kvm to enable native virtualization

  • cached images, since these are used to build the test cluster:

  • kubevirtci/os-3.10.0-crio:latest

  • kubevirtci/os-3.10.0-multus:latest

  • kubevirtci/os-3.10.0:latest

  • kubevirtci/k8s-1.10.4:latest

  • kubevirtci/k8s-multus-1.11.1:latest

  • kubevirtci/k8s-1.11.0:latest

How can we overcome this? Can we work together to build a suitable
requirements for running the tests so it passes each time?

Kind regards,
Petr Kotas

Activity

Show:

Petr Kotas September 17, 2018 at 3:44 PM

I have created the issue.

The failures in the stdci are only partially causing our failures. Most of
the failures are due to unknown timeouts.
For this I would like to see live load on the test machine. If this can be
done.

Thanks in advance.
Best,
Petr

On Mon, Sep 17, 2018 at 4:18 PM Barak Korren (oVirt JIRA) <

Barak Korren September 17, 2018 at 2:18 PM

ok, I see its failing in the docketr-cleaup script... hmm.. we'll need to debug that...

Can you please open a specific ticket on that and include logs and any other specific information that can help us figure out why it may be failing there... (What containers might be on the machine that its failing to remove...)

But that is not what is causing all the failures right? We already fixed a couple of issues with that script...

Petr Kotas September 17, 2018 at 2:00 PM

I would like to see the machine resources. The use of CPU, the use of RAM to understand how the tests behave live. I am not sure whether this is working on Blue Ocean.

WRT the docker, as I already pointer in the logs. The issue is way before our setup even kicks in.
Here is the direct link for that.
It seems that the jenkins project_setup.sh fails somehow. Again this is not our code, it is part of standard ci located here.
It seems that the project setup, was doing its job and than randomly failed due to networking issue. I have no idea why.

Also I do not thing the issue is due to proxy as the failures are totally random on random tests.
So I am guessing something more hidden fails.

Barak Korren September 17, 2018 at 8:31 AM

please be more specific about what you mean by monitoring access. you should already be able to monitor the job as it runs in multiple ways via Blue Ocean, the Jenkins old UI or the STDCI UI.

WRT Docker connection issues - where is the Docker command being lunched? If its launched from inside the containers that Kubevirt-CI creates, then its really not something I can control (AFAIK its sets up its own networking inside a special "dnsmasq" container).

If the thing that fails tries to connect directly from the STDCI environemnt to some service, it may have to do with the fact that we have a proxy configured in the environment. you ned to make sure the connections you're trying to setup and not being routed via the proxy by either setting the 'no_proxy' env var or unsetting the 'http_proxy' env var. I added code to do this in Kubevirt's test.sh a while ago, but maybe someone removed it.

Petr Kotas September 17, 2018 at 8:13 AM

Hi Barak,

I understand and agree that mailing list is a great place to share the
knowledge. I will write a summary there once we come up with a solution for
this issue.
And we need to figure the solution swiftly as TLV has holidays approaching.

The issues I have described are recurring for almost a 3 months and are
blocking us to progress with our work.
We already work on fixing the issue from our site and are working on
additional fixes to provide even more stable tests.

The other part is be sure, we are not crashing the CI.
Would you be able to give me a monitoring access so I can see whether there
are any race conditions, or we deplete some resources?

Regarding the timeouts. We are not relying on them. The timeout you have
seen in the logs, is from your infrastructure.
It signals, there is networking issue and the docker cannot connect to
localhost, which is weird.

So please, can I have monitoring access? And can you please check whether
the network does not have any issues?

Thank you for your help! I appreciate it.

Best,
Petr

On Sun, Sep 16, 2018 at 8:08 AM Barak Korren (oVirt JIRA) <

Barak Korren September 16, 2018 at 6:08 AM

I think the best place to discuss Kubevirt issues is on the Kubevirt-related mailing lists where other Kubvirt developers can see the discussion.

To your questions:

I am working on fixing the issues on the KubeVirt e2e test suites. This
task is directly related to unstable CI, due to unknown errors.
The progress is reported in the CNV trello:
https://trello.com/c/HNXcMEQu/161-epic-improve-ci

I am creating this issue since the KubeVirt experience random timeouts on
random tests most of the times when test suites run.
The issue from outside is showing as timeouts on difference part of tests.
Sometimes the tests fails in set up phase, again due to random timeout.
The example in the link bellow timed out for network connection on
localhost.

[check-patch.k8s-1.11.0-dev.el7.x86_64]
requests.exceptions.ReadTimeout:
UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
(read timeout=60)

Its generally a bad idea to rely to much on timeout in a test suit like this. We've seen such issues over and over again in OST as well. Its probably best to remove all such timeout definitions and just have overall timeout set for the entire test suit.

Example of failing test suites is here
https://jenkins.ovirt.org/job/kubevirt_kubevirt_standard-check-pr/1916/consoleText

The list of errors related to the failing CI can be found in my notes
https://docs.google.com/document/d/1_ll1DOMHgCRHn_Df9i4uvtRFyMK-bDCHEeGfJFTjvjU/edit#heading=h.vcfoo8hi48ul

I am not sure whether KubeVirt already shared the resource requirements, so
I provide short summary:
Resources for KubeVirt e2e tests:

at least 12GB of RAM - we start 3 nodes (3 docker images) each require
4GB of RAM
exposed /dev/kvm to enable native virtualization
cached images, since these are used to build the test cluster:
kubevirtci/os-3.10.0-crio:latest
kubevirtci/os-3.10.0-multus:latest
kubevirtci/os-3.10.0:latest
kubevirtci/k8s-1.10.4:latest
kubevirtci/k8s-multus-1.11.1:latest
kubevirtci/k8s-1.11.0:latest

How can we overcome this? Can we work together to build a suitable
requirements for running the tests so it passes each time?

To my knowledge the existing setup meets all the requirements you specify above.

We have 3 physical hosts that are used to run Kubevirt tests, each host has 128GB of ram and runs 7 containers where each container runs its own Libvirt, Docker and Systemd so that it looks like its own host to the tests running inside. The amount of containers per host was calculated to have each container have a little over 16GB of RAM for itself. So we should have more then enough for Kubevirt. Also, in our measurements Kubevirt's CI tests took way less then 12GB, ansd were somewhere around 8GB.

All the images that start with 'kubevirtci' are cached by the system.

WRT /dev/kvm - we do have it exposed in the containers we run, but I think that is irrelevant since AFAIK Kubevirt-CI runs qemu on its own inside its own container, so the /dev/kvm device files simply needs to exist inside that container.

Fixed

Details

Assignee

Reporter

Components

Priority

Created September 14, 2018 at 2:41 PM
Updated August 29, 2019 at 2:12 PM
Resolved December 15, 2018 at 3:54 PM