How does stdci prevent regressions and proactively monitor the cluster?

Description

We want to go one step further with KubeVirt and sooner or later only merge
when the tests are green (automatically).
Therefore we want to ensure that this CI system is the right system for us
and can be properly scaled, developed and operated.
Apart from requirements like, automatically re-run tests and a merge-pools
stability and QoS of the CI system are interesting for us.

Some examples:

  • Sometimes jobs break with a system error shown in the logs (is that
    monitored and worked on?)

  • Sometimes things like "out-of-disk-space" show up. Is e.g. disk
    utilization proactively handled?

  • We had one issue where the docker installation was broken in a
    build-slot and all jobs stopped fast. As a consequence all following builds
    were scheduled there too. Is something like that monitored?

  • We repeatedly have issues, connecting to jenkins. It is extremely slow
    (not just Blue-Ocean-slow, really slow). Are such things monitored and
    alarms raised, countermeasures taken?

  • That did not happen for a while, but there were repeatedly bare-metal
    machines whithout kvm-nesting added to the cluster. Are there measures in
    place which prevent such regressions where the same issues happen multiple
    times?

  • How is the flexibility of the project ensured? Is it also tested and
    maintained in a sane fashion to allow proper evolution in time? Automated
    tests? Offline-testing of changes? And so on ...

Activity

Show:

Eyal Edri December 24, 2018 at 10:35 AM

I believe we solved some of the issues here and some are in progress, for e.g:
We've identified the source for slowness on the UI and its a memory leak on the SSE plugin blue ocean is using,
please add a link to the ticket that refers to that.
We've also applied JVM improvements to the master, and limit the session timeout ( it was unlimited so far ).
Also, we're working on splitting the kubevirt Jenkins to be independent and not shared with oVirt, tracked on another ticket can add links.

We are also planning to add monitoring, hopefully soon, please add link to the card on it.
As for flexibility of the project, we're doing our best with the very limited resources we have available and the number of developers available to contribute.
Having said that, we have staging systems and we try to add tests to any new code that we introduce, including testing on staging.

Details

Assignee

Reporter

Priority

Created November 26, 2018 at 11:06 AM
Updated December 24, 2018 at 10:35 AM