durable-task-plugin aborting jobs
Description
Activity

Former user April 26, 2018 at 2:52 PM
All VMs running on gluster were shut down and will be replaced with systems running on local storage hosts. The timeout issue is no longer present in job logs.

Former user April 24, 2018 at 4:13 PMEdited
Looking at another VM while it's running jenkins_standard-check-patch and it does look like the I/O subsystem is at the bottom of this:
$ top
top - 16:06:37 up 4 days, 4:16, 1 user, load average: 7.11, 3.98, 1.74
Tasks: 115 total, 1 running, 77 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 1.3 sy, 0.2 ni, 0.0 id, 97.7 wa, 0.2 hi, 0.2 si, 0.0 st
KiB Mem : 8165284 total, 2605380 free, 420176 used, 5139728 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 7398388 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1653 root 39 19 600748 115444 21808 D 1.7 1.4 1:56.81 dnf
1445 jenkins+ 20 0 4571304 171356 22184 S 0.7 2.1 32:50.29 java
1887 jenkins+ 20 0 57484 4048 3508 R 0.7 0.0 0:00.18 top
1 root 20 0 223764 11820 8344 S 0.3 0.1 0:23.76 systemd
Load average is above 7 while I/O wait is at 97.7% with processes entering D state periodically. The hypervisor running this VM is not loaded.
The VMs attached to jenkins-staging are mostly running on top of our staging HyperConverged setup, so it's likely some new I/O bottleneck. I'll replace them with VMs running on local-storage hosts, verify gluster health and report bugs to HyperConverged if needed.

Former user April 24, 2018 at 12:30 PMEdited
Job that fails often with this error is:
http://jenkins-staging.phx.ovirt.org/job/jenkins_standard-check-patch/
All of the recent jobs seem to be affected, not just the ones after software versions related to were installed.
Looking at a slave running jenkins_standard-check-patch it does seem to have a lock file yet it's not updated properly:
[jenkins-staging@vm0177 ~]$ date
Tue Apr 24 12:22:46 UTC 2018
[jenkins-staging@vm0177 ~]$ ls -l /home/jenkins-staging/workspace/jenkins_standard-check-patch@tmp/
total 8
drwxrwxr-x. 2 jenkins-staging jenkins-staging 4096 Apr 24 12:19 durable-49607d40
drwx------. 2 jenkins-staging jenkins-staging 4096 Apr 24 12:15 secretFiles
pstree snippet from the same node:
java(1274)─┬─sh(17042)───script.sh(17046)───bash(17048)─┬─bash(17180)───mock(17188)───sh(17206)───rpmdb(17220)
. │ └─tee(17181)
. ├─sh(17426)─┬─script.sh(17429)───sudo(17488)───rm(17489)
. │ └─sh(17428)───sleep(17630)
I assume PID 17428 is the one responsible for updating the file, yet there was no strace installed to verify what it was doing before the job completed. On successfull job runs I see the failing yum step running for more than 10 minutes sometimes. Will investigate if this is induced by gluster slowness.

Former user April 23, 2018 at 12:38 PM
The plugin checks its own heartbeat which touches the file in question every 3 seconds:
https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L143
If this doesn't change than maybe there's something else chowning it and not letting the plugin write to it? Also, do we see that on bare metals with RAM disks or VMs as well?
Details
Details
Assignee

Reporter

It seems that the durable-task-plugin is aborting jobs that runs a shell script after (default) 15 seconds of inactivity. It considers a shell script inactive if it doesn't write any output to STDOUT.
I haven't been able yet to determine the root cause and why we don't see the same issue in production.
Maybe we should increase the HEARTBEAT_CHECK_INTERVAL as suggested in the error message. But why we don't see this error in production Jenkins but only in staging?