How does stdci prevent regressions and proactively monitor the cluster?
Description
Activity
Eyal Edri December 24, 2018 at 10:35 AM
@Former user I believe we solved some of the issues here and some are in progress, for e.g:
We've identified the source for slowness on the UI and its a memory leak on the SSE plugin blue ocean is using,
@Former user please add a link to the ticket that refers to that.
We've also applied JVM improvements to the master, and limit the session timeout ( it was unlimited so far ).
Also, we're working on splitting the kubevirt Jenkins to be independent and not shared with oVirt, tracked on another ticket @Barak Korren can add links.
We are also planning to add monitoring, hopefully soon, @Former user please add link to the card on it.
As for flexibility of the project, we're doing our best with the very limited resources we have available and the number of developers available to contribute.
Having said that, we have staging systems and we try to add tests to any new code that we introduce, including testing on staging.
We want to go one step further with KubeVirt and sooner or later only merge
when the tests are green (automatically).
Therefore we want to ensure that this CI system is the right system for us
and can be properly scaled, developed and operated.
Apart from requirements like, automatically re-run tests and a merge-pools
stability and QoS of the CI system are interesting for us.
Some examples:
Sometimes jobs break with a system error shown in the logs (is that
monitored and worked on?)
Sometimes things like "out-of-disk-space" show up. Is e.g. disk
utilization proactively handled?
We had one issue where the docker installation was broken in a
build-slot and all jobs stopped fast. As a consequence all following builds
were scheduled there too. Is something like that monitored?
We repeatedly have issues, connecting to jenkins. It is extremely slow
(not just Blue-Ocean-slow, really slow). Are such things monitored and
alarms raised, countermeasures taken?
That did not happen for a while, but there were repeatedly bare-metal
machines whithout kvm-nesting added to the cluster. Are there measures in
place which prevent such regressions where the same issues happen multiple
times?
How is the flexibility of the project ensured? Is it also tested and
maintained in a sane fashion to allow proper evolution in time? Automated
tests? Offline-testing of changes? And so on ...