"PLEG is not healthy" errors on OpenShift nodes.
Environment
- OpenShift Container Platform 3
Issue
-
OpenShift node becomes not ready, events show PLEG becomes unresponsive:
Ready False Mon, 13 Nov 2017 13:41:40 +0530 Sun, 05 Nov 2017 18:22:28 +0530 KubeletNotReady PLEG is not healthy: pleg was last seen active 3m20s ago; threshold is 3m0s -
PLEG is not healthy: pleg was last seen active 3m9.71452035s ago; threshold is 3m0s
-
GenericPLEG: Unable to retrieve pods: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
-
Node appears to be hung - in 'notReady' status with PLEG errors and pods in 'Unknown' status
Resolution
- For OCP 4.x, a solution is created to address the PLEG issue and can refer here Openshift nodes NotReady due to "PLEG is not healthy" issue
When node enters a NotReady state due to an unhealthy PLEG confirm that a container list command returns on the node.
# curl --unix-socket /var/run/docker.sock http://v1.26/containers/json?all=1
If this request times out gather a docker core and hosts logs. Open a This content is not included.support case if a root cause analysis is required.
To get the node back to a working state while the issue is being investigated, usually all that is needed is a restart of docker.
# systemctl restart docker atomic-openshift-node
If the version of docker running is below docker-1.13.1-94 an update of docker is needed to the latest 1.13 release as there are multiple bug fixes that have addressed docker hanging issues. 1 2
Also, PLEG issue fixed in docker-1.13.1-209 released on Feb 14, 2022 RHBA-2022:0526.
# atomic-openshift-docker-excluder unexclude
# yum upgrade docker -y
# atomic-openshift-docker-excluder exclude
If this does not resolve the issue, the docker hang could be due to issues with the docker storage. The fix would be to upgrade docker to the latest 1.13 release and recreate the docker storage. Note docker 1.13 is only supported with OpenShift 3.9+
Drain Node
# oc adm drain <node>
Stop Services
# systemctl stop docker atomic-openshift-node
Upgrade docker
# atomic-openshift-docker-excluder unexclude
# yum upgrade docker -y
# atomic-openshift-docker-excluder exclude
Reset storage
# docker-storage-setup --reset
# rm -rf /var/lib/docker/*
# rm -rf /var/lib/dockershim/*
# rm -rf /var/lib/cni/networks/openshift-sdn/*
Setup up storage again
# docker-storage-setup
# systemctl start docker atomic-openshift-node
Test Running a container.
# docker run --rm -it --entrypoint=echo registry.access.redhat.com/rhel7 hello
Mark Node as schedulable
# oc adm uncordon <node>
Further PLEG issues can result from the node being overloaded or over scheduled running more than the This page is not included, but the link has been rewritten to point to the nearest parent document.maximum supported 250 pods per node.
- Max number of pods can be enforced setting max-pods under kubeletArguments in the
node-config.yaml. - Max pods per core should also be set setting pods-per-core under kubeletArguments in the
node-config.yaml. - If the node is overloaded and high cpu load is seen on the node, cpu limits should be set on all pods
Root Cause
See Understanding: PLEG is not healthy for more details around this function.
PLEG is a simple loop that does the following:
- list pod sandbox containers
- list pod containers
- wait 1s
- return to step 1.
The first two steps result in a list call from OpenShift to docker and can be replicated with curl making a request to the docker socket:
# curl --unix-socket /var/run/docker.sock http://v1.26/containers/json?all=1
PLEG is marked unhealthy if the interval for any of these loops exceeds 3 minutes. An unhealthy PLEG marks a node NotReady to defer the scheduler from scheduling additional pods to the node until the machine stabilizes.
A NotReady node will still work for the amount of time specified by the parameter pod-eviction-timeout (defaults to 5m) set in master-controllers. After that time has passed (and only if the issue persist) the node will be evacuated to prevent a major outage.
On the other hand, if the node is recovered from the condition which marked it as NotReady (such as docker daemon being stable again), the controller will set it up again to Ready status.
Diagnostic Steps
- Confirm if the following commands time out:
# curl --unix-socket /var/run/docker.sock http://v1.26/containers/json?all=1
# docker ps
# docker info
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.