OpenShift nodes NotReady: PLEG is not healthy

General

Additional Info

General

Additional Info

Description

Several OpenShift bare metals in PHX are marked as NotReady with "PLEG not healthy" errors:

origin-node[38916]: I0225 13:13:56.574031 38916 kubelet.go:1779] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 4h14m43.440447302s ago; threshold is 3m0s]

Need to confirm the reason since this destabilizes the CI system and decreases its capacity.

Attachments

Activity

Show:

Former user February 27, 2020 at 5:28 PM

Thanks, with just 1 pod trying to start it was easy to find the fault - turned out to be a typo in the pod spec. Specifically, CI_RUNTIME_UNAME was set to jenkins2 for some reason. Setting this back to “jenkins“ fixed proper pod operation.

We’re also monitoring file descriptors using prometheus now so and alert should trigger once it’s above 40k but looking at the graph after removing kdump the value never went higher than 20k

Barak Korren February 26, 2020 at 3:40 AM

I aborted all the jobs for merged patches and old patchsets, the only one still running atm is this one which is testing patchset #3 of patch #107167.

See if you can make it succeed in allocating a container and continuing to run.

Barak Korren February 26, 2020 at 3:31 AM

We need to figure out why staging is failing to allocate PODs, and do that before production ends up behaving the same way...

Please look at this more closely and make it work rather then disable staging. You can see the logs of the K8s plugin from the jenkins UI (We have a specialized filter setup to enable watching just those logs).

Anyway I've had a brief look - it seems the plugin is managing to allocate PODs only they do not connect back over JNLP. It may be the case that the JNLP thread had died on the staging master.

It also seems that most jobs can also be aborted since they are running for patches we have already merged.

Former user February 25, 2020 at 6:44 PM

I looked at the k8s plugin config on the staging Jenkins, looks pretty much the same but uses a different version of the image. Changing it to the one from prod sisn’t change much so I’m stuck here. If this does not work and puts extra load on the OpenShift yet we have no way to fix it I’ll have to go ahead and disable the whole staging namespace to stabilize things.

Barak Korren February 25, 2020 at 6:05 PM

long queues on the staging jenkins are to be expected given the small amount of VM slaves we have there.

WRT to containerized slaves - by the time they show up in the jenkins gui they are always offline. It means nothing since the K8s plugin allocates new slaves on the fly.

I think the current issues you are seeing are due to the new containerized slave CI coverage we added recently (Basically code changes to the CI system are now tested on containerized slaves as well. We added this after we had a bunch of issues on them caused by a change to global_setup.sh).

Please check the K8s plugin configuration on the staging jenkins, it may be the case that the configuration we have there is bad and its causing the Ks plugin to run in endless loops trying to allocate PODs that contantly fail.

Fixed

Details
Assignee
Former user(Deactivated)
Reporter
Former user(Deactivated)
Priority
High

Created February 25, 2020 at 1:33 PM

Updated February 27, 2020 at 5:28 PM

Resolved February 27, 2020 at 5:28 PM

OpenShift nodes NotReady: PLEG is not healthy

Description

Attachments

Activity

Former user February 27, 2020 at 5:28 PM

Barak Korren February 26, 2020 at 3:40 AM

Barak Korren February 26, 2020 at 3:31 AM

Former user February 25, 2020 at 6:44 PM

Barak Korren February 25, 2020 at 6:05 PM

DetailsAssigneeFormer userFormer user(Deactivated)ReporterFormer userFormer user(Deactivated)PriorityHigh

Details

Assignee

Reporter

Priority

Details
Assignee
Former user(Deactivated)
Reporter
Former user(Deactivated)
Priority
High