Several OpenShift bare metals in PHX are marked as NotReady with "PLEG not healthy" errors:
origin-node: I0225 13:13:56.574031 38916 kubelet.go:1779] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 4h14m43.440447302s ago; threshold is 3m0s]
Need to confirm the reason since this destabilizes the CI system and decreases its capacity.
long queues on the staging jenkins are to be expected given the small amount of VM slaves we have there.
WRT to containerized slaves - by the time they show up in the jenkins gui they are always offline. It means nothing since the K8s plugin allocates new slaves on the fly.
I think the current issues you are seeing are due to the new containerized slave CI coverage we added recently (Basically code changes to the CI system are now tested on containerized slaves as well. We added this after we had a bunch of issues on them caused by a change to global_setup.sh).
Please check the K8s plugin configuration on the staging jenkins, it may be the case that the configuration we have there is bad and its causing the Ks plugin to run in endless loops trying to allocate PODs that contantly fail.
I looked at the k8s plugin config on the staging Jenkins, looks pretty much the same but uses a different version of the image. Changing it to the one from prod sisn’t change much so I’m stuck here. If this does not work and puts extra load on the OpenShift yet we have no way to fix it I’ll have to go ahead and disable the whole staging namespace to stabilize things.
We need to figure out why staging is failing to allocate PODs, and do that before production ends up behaving the same way...
Please look at this more closely and make it work rather then disable staging. You can see the logs of the K8s plugin from the jenkins UI (We have a specialized filter setup to enable watching just those logs).
Anyway I've had a brief look - it seems the plugin is managing to allocate PODs only they do not connect back over JNLP. It may be the case that the JNLP thread had died on the staging master.
It also seems that most jobs can also be aborted since they are running for patches we have already merged.
Thanks, with just 1 pod trying to start it was easy to find the fault - turned out to be a typo in the pod spec. Specifically, CI_RUNTIME_UNAME was set to jenkins2 for some reason. Setting this back to “jenkins“ fixed proper pod operation.
We’re also monitoring file descriptors using prometheus now so and alert should trigger once it’s above 40k but looking at the graph after removing kdump the value never went higher than 20k