Fix Jenkins slave connection dying on vdsm check_merged jobs

Description

Something in the vdsm build_artifacs job makes the Jenkins slave disconnect when it is running. This in turn makes the cleanup scripts not run on the slave leaving it dirty enough to make the next job on that slave fail.

Example of this can be seen here:
http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/692/console

Relevant log lines:

21:49:00 Ran 44 tests in 1231.988s 21:49:00 21:49:00 OK 21:49:00 + return 0 21:49:00 sh: [13086: 1 (255)] tcsetattr: Inappropriate ioctl for device 21:49:00 Took 2464 seconds 21:49:00 =================================== 21:49:00 logout 21:49:01 Slave went offline during the build 21:49:01 ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel 21:49:01 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) 21:49:01 Caused by: java.io.EOFException 21:49:01 at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353) 21:49:01 at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822) 21:49:01 at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804) 21:49:01 at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301) 21:49:01 at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) 21:49:01 at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) 21:49:01 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) 21:49:01 21:49:01 Build step 'Execute shell' marked build as failure 21:49:01 Performing Post build task... 21:49:01 Match found for :.* : True 21:49:01 Logical operation result is TRUE 21:49:01 Running script : #!/bin/bash -x 21:49:01 echo "shell-scripts/mock_cleanup.sh" ... SNIP ... 21:49:01 Exception when executing the batch command : no workspace from node hudson.slaves.DumbSlave[fc24-vm06.phx.ovirt.org] which is computer hudson.slaves.SlaveComputer@30863c81 and has channel null 21:49:01 Build step 'Post build task' marked build as failure 21:49:02 ERROR: Step ?Archive the artifacts? failed: no workspace for vdsm_master_check-merged-el7-x86_64 #692 21:49:02 ERROR: Failed to evaluate groovy script. 21:49:02 java.lang.NullPointerException: Cannot invoke method child() on null object 21:49:02 at org.codehaus.groovy.runtime.NullObject.invokeMethod(NullObject.java:77) 21:49:02 at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:45) 21:49:02 at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42) 21:49:02 at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:32) 21:49:02 at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42) 21:49:02 at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108) 21:49:02 at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) 21:49:02 at Script1.run(Script1.groovy:2) 21:49:02 at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580) 21:49:02 at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618) 21:49:02 at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589) 21:49:02 at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript.evaluate(SecureGroovyScript.java:166) 21:49:02 at org.jvnet.hudson.plugins.groovypostbuild.GroovyPostbuildRecorder.perform(GroovyPostbuildRecorder.java:361) 21:49:02 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) 21:49:02 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782) 21:49:02 at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:723) 21:49:02 at hudson.model.Build$BuildExecution.post2(Build.java:185) 21:49:02 at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:668) 21:49:02 at hudson.model.Run.execute(Run.java:1763) 21:49:02 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 21:49:02 at hudson.model.ResourceController.execute(ResourceController.java:98) 21:49:02 at hudson.model.Executor.run(Executor.java:410) 21:49:02 Build step 'Groovy Postbuild' marked build as failure 21:49:02 Started calculate disk usage of build 21:49:02 Finished Calculation of disk usage of build in 0 seconds 21:49:02 Finished: FAILURE

Activity

Show:

Eyal Edri January 26, 2017 at 1:37 PM

The issue was in check-merged script.

Eyal Edri January 26, 2017 at 1:36 PM

Sorry, just read response,
So closing this for now, please re-open if anything else is needed from infra.

Eyal Edri January 26, 2017 at 1:36 PM

maybe this is due to the java auto updating in puppet?

danken December 26, 2016 at 12:43 PM

This is indeed due to our buggy check-merged script, which mistakenly called `kill 0`.

Barak Korren December 22, 2016 at 4:25 PM

Tested to see if behaviour would different when running on an EL7 host (right now its running in an EL7 chroot on a Fedora host)
http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/777

Same results.

I'm beginning to suspect it something the tests are doing, same hosts run Lago fine for other things.
Thinks it may be this patch:
https://gerrit.ovirt.org/#/c/68078/

He sent an email to devel, investigation will continue.

Done

Details

Assignee

Reporter

Components

Priority

Created December 14, 2016 at 8:10 AM
Updated May 25, 2017 at 11:31 AM
Resolved January 26, 2017 at 1:37 PM

Flag notifications