Docker hangs after upgrading to docker 1.13.1-90 or 1.13.1-91

Solution Verified - Updated

Environment

  • docker-1.13.1-90.git07f3374
  • docker-1.13.1-91.git07f3374

Issue

  • After upgrading to docker-1.13.1-90.git07f3374 or docker-1.13.1-91.git07f3374, docker service will not start
  • Docker daemon will not start, journal shows level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container error
  • After enabling the Live Restore feature in Docker and restarting, the daemon hangs on docker ps calls.
  • After upgrading to docker-1.13.1-90.git07f3374 or docker-1.13.1-91.git07f3374, docker commands hang indefinitely.
  • OpenShift nodes reporting PLEG is not healthy and node NotReady after upgrading to docker-1.13.1-90.git07f3374 or docker-1.13.1-91.git07f3374.

Resolution

This is has been fixed in docker-1.13.1-94.gitb2f74b2 which was released in RHSA-2019:0487. Please update to docker-1.13.1-94.gitb2f74b2 or later to prevent this issue from happening.

Root Cause

The problem can occur under the following circumstances.

  1. A container is started by the currently running instance of containerd.

  2. The currently running instance of containerd terminates and a new instance of containerd is started during one of these scenarios:

    • Scenario 2a) If the Live Restore feature is enabled and the docker service gets restarted, for example via systemctl restart docker.

    • Scenario 2b) Containerd becomes unresponsive and gets killed/restarted by dockerd.1 This scenario is independent of the Live Restore feature.

  3. The container that was started in step 1 by the previously running instance of containerd now terminates. Due to a flaw in the RHEL7 docker packages mentioned under 'Environment', a crucial function in containerd gets blocked indefinitely at the time when it tries to clean up the terminated container.

At this stage, containerd is no longer able to handle requests from dockerd. A docker ps command will show the terminated container in state running because containerd is unable to report the termination of the container back to dockerd. If an attempt is made to kill that container, the request will be passed to containerd but will remain pending indefinitely. Subsequent docker ps commands will hang because the container is locked by the pending kill request.

1

Dockerd communicates with containerd via an AF_UNIX type socket connection. Dockerd periodically sends a health check request to containerd. If containerd fails to respond, dockerd kills the currently running instance of containerd and starts a new one. This is indicated by killing and restarting containerd messages in the journal log.

Diagnostic Steps

  • Run docker ps and monitor for immediate hang
  • Check node status: oc get nodes
  • Check OpenShift node logs for PLEG health issues: journalctl -u atomic-openshift-node | grep 'PLEG is not healthy'
  • Look in journal logs for warning messages that the client is out of sync: journalctl -u dockerd-current| grep 'level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container'
  • How to automatically gather a coredump right after an OpenShift PLEG issue
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.