Docker hangs after upgrading to docker 1.13.1-90 or 1.13.1-91
Environment
docker-1.13.1-90.git07f3374docker-1.13.1-91.git07f3374
Issue
- After upgrading to
docker-1.13.1-90.git07f3374ordocker-1.13.1-91.git07f3374, docker service will not start - Docker daemon will not start, journal shows
level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced containererror - After enabling the Live Restore feature in Docker and restarting, the daemon hangs on
docker pscalls. - After upgrading to
docker-1.13.1-90.git07f3374ordocker-1.13.1-91.git07f3374, docker commands hang indefinitely. - OpenShift nodes reporting
PLEG is not healthyand nodeNotReadyafter upgrading todocker-1.13.1-90.git07f3374ordocker-1.13.1-91.git07f3374.
Resolution
This is has been fixed in docker-1.13.1-94.gitb2f74b2 which was released in RHSA-2019:0487. Please update to docker-1.13.1-94.gitb2f74b2 or later to prevent this issue from happening.
Root Cause
The problem can occur under the following circumstances.
-
A container is started by the currently running instance of
containerd. -
The currently running instance of
containerdterminates and a new instance ofcontainerdis started during one of these scenarios:-
Scenario 2a) If the Live Restore feature is enabled and the docker service gets restarted, for example via
systemctl restart docker. -
Scenario 2b)
Containerdbecomes unresponsive and gets killed/restarted bydockerd.1 This scenario is independent of the Live Restore feature.
-
-
The container that was started in step 1 by the previously running instance of
containerdnow terminates. Due to a flaw in the RHEL7 docker packages mentioned under 'Environment', a crucial function incontainerdgets blocked indefinitely at the time when it tries to clean up the terminated container.
At this stage, containerd is no longer able to handle requests from dockerd. A docker ps command will show the terminated container in state running because containerd is unable to report the termination of the container back to dockerd. If an attempt is made to kill that container, the request will be passed to containerd but will remain pending indefinitely. Subsequent docker ps commands will hang because the container is locked by the pending kill request.
Dockerd communicates with containerd via an AF_UNIX type socket connection. Dockerd periodically sends a health check request to containerd. If containerd fails to respond, dockerd kills the currently running instance of containerd and starts a new one. This is indicated by killing and restarting containerd messages in the journal log.
Diagnostic Steps
- Run
docker psand monitor for immediate hang - Check node status:
oc get nodes - Check OpenShift node logs for PLEG health issues:
journalctl -u atomic-openshift-node | grep 'PLEG is not healthy' - Look in journal logs for warning messages that the client is out of sync:
journalctl -u dockerd-current| grep 'level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container' - How to automatically gather a coredump right after an OpenShift PLEG issue
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.