VDSM CI failes on all new FC28 slaves because of the 'mock' GID

Description

We see all check-patch jobs for VDSM failing when running on the new FC28 slaves, with an error like the following:

The reason for this issue is that mock tries to have the 'mock' group have the same GID inside and outside of the mock environment. In the case of the new slaves the GID for the 'mock' group is 1000. Since 1000 is the 1st GID you get of you run 'groupadd' the probability of it colliding with any GID created by a package inside the 'mock' env is quite high.

In the case of the VDSM CI, the GID 1000 is taken by the 'openvswitch' package for the 'hugetlbfs' group. This on its own is a bug because no package should be taking GIDs beyond 1000, but having said that, the CI should not break on this, so we need to move the mock GID to below 1000 to a number that is unlikely to collide with others.

In the meantime all fc28 slaves will be disabled.

Activity

Show:

Former user June 18, 2018 at 2:54 PM

Re-triggered some VDSM check-patch runst that failed before - they still fail but due to tests failing may be a sign of an actual bug:

before: https://jenkins.ovirt.org/job/vdsm_master_check-patch-fc28-x86_64/8/console
after: https://jenkins.ovirt.org/job/vdsm_master_check-patch-fc28-x86_64/23/console

I believe we can close this ticket and log a new one if a different issue is discovered.

Former user June 18, 2018 at 12:46 PM

Here are the default cloud-init stages:
https://github.com/cloud-init/cloud-init/blob/master/config/cloud.cfg.tmpl

I moved users-groups from cloud_init_modules to cloud_final_modules after package-update-upgrade-install and this results in correct GID mapping. All fc28 machines have been rebuilt. I'll do one more verification round before putting them online

Barak Korren June 18, 2018 at 10:51 AM

Created to add preventative measures so we'll catche issues like this earlier next time.

Former user June 18, 2018 at 8:57 AM

Looking into this together with Barak, GID 135 seems to be the default one for mock:
https://bugzilla.redhat.com/show_bug.cgi?id=928063

Cloud-init logs show that users and groups are defined before software is installed:
2018-06-07 09:30:18,355 - util.py[DEBUG]: Running command ['groupadd', 'mock'] with allowed return codes [0] (shell=False, capture=True)
2018-06-07 09:30:18,394 - _init_.py[INFO]: Created new group mock
2018-06-07 09:30:18,395 - _init_.py[DEBUG]: created group 'mock' for user 'jenkins'
2018-06-07 09:30:18,395 - _init_.py[DEBUG]: Adding user jenkins
...
2018-06-07 09:31:20,519 - util.py[DEBUG]: Running command ['dnf', '-y', 'install', 'ovirt-guest-agent-common', 'java-1.8.0-openjdk-headless', 'kernel-core', 'gdbm', 'glibc', 'systemd', 'git', 'mock', 'PyYAML'] with allowed return codes [0] (shell=False, capture=False)

changing the order of cloud_init_modules should fix this

Former user June 18, 2018 at 8:34 AM

Looking at some of our systems:

system

distro

GID

vm0055

fc28

/etc/group:mock:x:1000:jenkins,jenkins-staging

vm0150

fc27

/etc/group:mock:x:1000:jenkins,jenkins-staging

vm0003

el7

/etc/group:mock:x:135:jenkins

Need to verify the order in which cloud-init is performed to verify if mock group mambership assignment is performed before mock install that should assign a lower GID. Here's the cloud-init for reference:
https://gerrit-staging.phx.ovirt.org/gitweb?p=ederevea-infra-test.git;a=blob;f=cloud-init/fedora/fedora.yaml;h=1981e4fcba147c2854c83f1a4ed9f12a7364e31b;hb=refs/heads/master

The packages section is defined before users but it may be run in a different order

Fixed

Details

Assignee

Reporter

Components

Priority

Created June 18, 2018 at 8:18 AM
Updated September 2, 2018 at 3:50 PM
Resolved June 19, 2018 at 10:44 AM