"systemctl docker resatert" can get stuck forever

Description

It seems that in some situations the "systemctl docker restart" command can get stuck forever.

The seems to be related to storage setup issue but may not be.

The issue was detected on an FC24 slave:
vm0064.workers-phx.ovirt.org

Seem linked tickets for more details and implications.

Activity

Show:

Former user June 7, 2017 at 11:47 AM

Hanging "systemctl docker restart" is what caused this ticket in the first place. If we don't do it then we'll either get rid of the issue or it will move into some other part.
That's why I want to check if we really need it for anything. In my opinion dropping containers and images should be enough to have a "clean start" for the next run and "systemctl start docker" should be enough to ensure the daemon is running when the job starts.

We have libvirt cleanup steps in mock_cleanup.sh and they don't restart the daemon. Can we try the same approach for docker as well? It should come up after restart and it does so most of the time. After all, mock_cleanup.sh restarts docker after each job, even if it didn't involve containers and we only had reports of stuck docker a few times so far so it isn't a very common issue from what I understand. So if we restart less often, there will hopefully be an even lower chance of hitting this.

Former user June 7, 2017 at 10:18 AM
Edited

we're restarting docker at the cleanup of every build to make sure we get a 'clean start' for the next build to come.
In addition, we're restarting docker at the beginning of every build to make sure that the daemon is up.
you suspect that the restart might cause this issue?

Former user June 7, 2017 at 8:54 AM

Is there a problem with using the standard "docker rmi" command to delete images during post-job cleanup?

Thanks for the info. I see that in mock_cleanup.sh we remove images using the usual "docker rmi" command and then the daemon is restarted. Is there a real need for a restart? Did we have problems without this step?

Former user June 7, 2017 at 7:32 AM

between jobs, we remove some container images, and restart the docker service. you think it may have some effect on this issue?

Former user June 6, 2017 at 3:05 PM

Checked vm0064 and it looks like the rootfs device /dev/vda3 was mounted as /var/lib/docker/devicemapper

This apparently caused the following error from docker on startup:
devmapper: Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored.

/etc/sysconfig/docker-storage-setup does not contain anything so not sure if docker-storage-setup had anything to do with it. This mount likely locked up docker and may eventually corrupt the OS as its root was mounted inside docker's work directory.

The fs wouldn't unmount (since on kernel side it's still considered as the rootfs) so the fix was to reboot the VM.

To move forward it would be good to know what we do on docker side between job runs which may cause this. do you have any details?

Barak Korren May 31, 2017 at 8:03 AM

Stack trace of the stuck docker process, hopefully we can find a way to automatically recover from this:

Fixed

Details

Assignee

Reporter

Components

Priority

Created May 31, 2017 at 7:43 AM
Updated November 1, 2017 at 10:03 AM
Resolved October 3, 2017 at 12:35 PM