vdsm project CQ failure: 004_basic_sanity.preview_snapshot_with_memory
Description
Activity
Eyal Edri August 21, 2018 at 2:29 PM
No replies from developers, closing.
If this will reappear, another 2 week period will be set to get a possible debug or RCA from developers.
Eyal Edri August 21, 2018 at 2:28 PM
@Dafna Ron we can't keep chasing this,
For now let's set a deadline of 2 weeks for get replies on races from developers. ( eventually we'll shorten it for one week ).
if there isn't a maintainer that answers or provides info within a week, please close the ticket.
Dafna Ron July 11, 2018 at 11:40 AM
@Doron Fediuck @Eyal Edri, @Gal Ben Haim and I were discussing this yesterday.
The majority of the current failed tests that are marked as "race" are caused due to locks and tasks cleanup in engine and we should not be failing the tests if they are modified to query tasks/locked objects before command is run and fail on query timeout.
The Network team fixed their testes which were failing on similar issues and I am happy to say it reduced these sort of failures for network.
The majority of these failures today are in storage related tests which @Tal Nisan has been notified and has assigned @Daniel Erez to look at.
However, if I continue to see these tests failing sporadically (and I mean a test that is failing sporadically over a period of a month) I may have to skip those tests until they are fixed.
We can improve on deciding what is an expectable % of failures and for what period of time we should quantify it before we skip the tests and decide on a cooperation with developers regarding the tests ownership (i.e maybe have a point of contact on each team for ost)
Eyal Edri July 11, 2018 at 11:04 AM
We need to try to formalize some process around it, I'm not talking about one-time failures, its tests that fail every once in a while.
There are not "environment" issues, this is isolated Lago run, so the chance of that happening is very low.
We also filter out any issues that are infra related, so only if a real code race happens, we open a ticket and mark it as 'race'.
If developers need more info/logs/insights then they should ask for it, otherwise, I think that we'll just start disabling tests that fail too often, this is taking a heavy toll on monitoring and manual work from the team, we need to improve the current handling of race failures.
Doron Fediuck July 11, 2018 at 10:47 AM
A race by definition may or may not materialize. If it's consistent that's an easy thing to fix.
If it's not, We need some more insights than the end result. For example, is there an environmental issue, etc.
we have a failed vdsm test in basic master suite.
004_basic_sanity.preview_snapshot_with_memory
seems like a regression.
' 2018-06-12 07:31:08,080-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-27) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth
2018-06-12 07:31:08,087-04 ERROR [org.ovirt.engine.api.restapi.resource.validation.ValidationExceptionMapper] (default task-27) [] Input validation failed while processing 'POST' request for path '/vms/ec6dcfb7-dda9-4c58-b048-87b08ffcbfbe
/previewsnapshot'.
2018-06-12 07:31:08,087-04 ERROR [org.ovirt.engine.api.restapi.resource.validation.ValidationExceptionMapper] (default task-27) [] Exception: org.ovirt.api.metamodel.server.ValidationException: Parameter 'snapshot.id' is mandatory but was
not provided.
at org.ovirt.engine.api.resource.VmResourceHelper.validatePreviewSnapshot(VmResourceHelper.java:75) [restapi-definition.jar:]
at org.ovirt.engine.api.resource.VmResource.doPreviewSnapshot(VmResource.java:333) [restapi-definition.jar:]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_171]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_171]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_171]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_171]'
Change 90532,7 (vdsm) is probably the reason behind recent system test failures
in the "ovirt-master" change queue and needs to be fixed.
This change had been removed from the testing queue. Artifacts build from this
change will not be released until it is fixed.
For further details about the change see:
https://gerrit.ovirt.org/#/c/90532/7
For failed test results see:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8169/