durable-task-plugin aborting jobs

Description

It seems that the durable-task-plugin is aborting jobs that runs a shell script after (default) 15 seconds of inactivity. It considers a shell script inactive if it doesn't write any output to STDOUT.

I haven't been able yet to determine the root cause and why we don't see the same issue in production.

Maybe we should increase the HEARTBEAT_CHECK_INTERVAL as suggested in the error message. But why we don't see this error in production Jenkins but only in staging?

Activity

Show:

Former user April 26, 2018 at 2:52 PM

All VMs running on gluster were shut down and will be replaced with systems running on local storage hosts. The timeout issue is no longer present in job logs.

Former user April 24, 2018 at 4:13 PM
Edited

Looking at another VM while it's running jenkins_standard-check-patch and it does look like the I/O subsystem is at the bottom of this:

$ top
top - 16:06:37 up 4 days, 4:16, 1 user, load average: 7.11, 3.98, 1.74
Tasks: 115 total, 1 running, 77 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 1.3 sy, 0.2 ni, 0.0 id, 97.7 wa, 0.2 hi, 0.2 si, 0.0 st
KiB Mem : 8165284 total, 2605380 free, 420176 used, 5139728 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 7398388 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1653 root 39 19 600748 115444 21808 D 1.7 1.4 1:56.81 dnf
1445 jenkins+ 20 0 4571304 171356 22184 S 0.7 2.1 32:50.29 java
1887 jenkins+ 20 0 57484 4048 3508 R 0.7 0.0 0:00.18 top
1 root 20 0 223764 11820 8344 S 0.3 0.1 0:23.76 systemd

Load average is above 7 while I/O wait is at 97.7% with processes entering D state periodically. The hypervisor running this VM is not loaded.

The VMs attached to jenkins-staging are mostly running on top of our staging HyperConverged setup, so it's likely some new I/O bottleneck. I'll replace them with VMs running on local-storage hosts, verify gluster health and report bugs to HyperConverged if needed.

Former user April 24, 2018 at 12:30 PM
Edited

Job that fails often with this error is:
http://jenkins-staging.phx.ovirt.org/job/jenkins_standard-check-patch/

All of the recent jobs seem to be affected, not just the ones after software versions related to were installed.

Looking at a slave running jenkins_standard-check-patch it does seem to have a lock file yet it's not updated properly:

[jenkins-staging@vm0177 ~]$ date
Tue Apr 24 12:22:46 UTC 2018
[jenkins-staging@vm0177 ~]$ ls -l /home/jenkins-staging/workspace/jenkins_standard-check-patch@tmp/
total 8
drwxrwxr-x. 2 jenkins-staging jenkins-staging 4096 Apr 24 12:19 durable-49607d40
drwx------. 2 jenkins-staging jenkins-staging 4096 Apr 24 12:15 secretFiles

pstree snippet from the same node:

java(1274)─┬─sh(17042)───script.sh(17046)───bash(17048)─┬─bash(17180)───mock(17188)───sh(17206)───rpmdb(17220)
. │ └─tee(17181)
. ├─sh(17426)─┬─script.sh(17429)───sudo(17488)───rm(17489)
. │ └─sh(17428)───sleep(17630)

I assume PID 17428 is the one responsible for updating the file, yet there was no strace installed to verify what it was doing before the job completed. On successfull job runs I see the failing yum step running for more than 10 minutes sometimes. Will investigate if this is induced by gluster slowness.

Former user April 23, 2018 at 12:38 PM

The plugin checks its own heartbeat which touches the file in question every 3 seconds:
https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L143

If this doesn't change than maybe there's something else chowning it and not letting the plugin write to it? Also, do we see that on bare metals with RAM disks or VMs as well?

Fixed

Details

Assignee

Reporter

Priority

Created April 22, 2018 at 12:32 PM
Updated May 1, 2018 at 7:15 AM
Resolved April 26, 2018 at 2:52 PM