"PLEG is not healthy" errors on OpenShift nodes.

Solution Verified - Updated 14 Jun 2024

Environment

OpenShift Container Platform 3

Issue

OpenShift node becomes not ready, events show PLEG becomes unresponsive:

  Ready 		False 	Mon, 13 Nov 2017 13:41:40 +0530 	Sun, 05 Nov 2017 18:22:28 +0530 	KubeletNotReady 		PLEG is not healthy: pleg was last seen active 3m20s ago; threshold is 3m0s

PLEG is not healthy: pleg was last seen active 3m9.71452035s ago; threshold is 3m0s
GenericPLEG: Unable to retrieve pods: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Node appears to be hung - in 'notReady' status with PLEG errors and pods in 'Unknown' status

Resolution

For OCP 4.x, a solution is created to address the PLEG issue and can refer here Openshift nodes NotReady due to "PLEG is not healthy" issue

When node enters a NotReady state due to an unhealthy PLEG confirm that a container list command returns on the node.

 # curl --unix-socket /var/run/docker.sock http://v1.26/containers/json?all=1

If this request times out gather a docker core and hosts logs. Open a This content is not included.support case if a root cause analysis is required.

To get the node back to a working state while the issue is being investigated, usually all that is needed is a restart of docker.

 # systemctl restart docker atomic-openshift-node

If the version of docker running is below docker-1.13.1-94 an update of docker is needed to the latest 1.13 release as there are multiple bug fixes that have addressed docker hanging issues. 1 2

Also, PLEG issue fixed in docker-1.13.1-209 released on Feb 14, 2022 RHBA-2022:0526.

# atomic-openshift-docker-excluder unexclude
# yum upgrade docker -y 
#  atomic-openshift-docker-excluder exclude

If this does not resolve the issue, the docker hang could be due to issues with the docker storage. The fix would be to upgrade docker to the latest 1.13 release and recreate the docker storage. Note docker 1.13 is only supported with OpenShift 3.9+

Drain Node
# oc adm drain <node>

Stop Services
# systemctl stop docker atomic-openshift-node 

Upgrade docker
# atomic-openshift-docker-excluder unexclude
# yum upgrade docker -y 
#  atomic-openshift-docker-excluder exclude

Reset storage 
# docker-storage-setup --reset 
# rm -rf /var/lib/docker/*
# rm -rf /var/lib/dockershim/* 
# rm -rf /var/lib/cni/networks/openshift-sdn/*

Setup up storage again
# docker-storage-setup
# systemctl start docker atomic-openshift-node 

Test Running a container. 
#  docker run --rm -it --entrypoint=echo registry.access.redhat.com/rhel7 hello

Mark Node as schedulable 
# oc adm uncordon <node>

Further PLEG issues can result from the node being overloaded or over scheduled running more than the This page is not included, but the link has been rewritten to point to the nearest parent document.maximum supported 250 pods per node.

Max number of pods can be enforced setting max-pods under kubeletArguments in the node-config.yaml.
Max pods per core should also be set setting pods-per-core under kubeletArguments in the node-config.yaml.
If the node is overloaded and high cpu load is seen on the node, cpu limits should be set on all pods

Root Cause

See Understanding: PLEG is not healthy for more details around this function.

PLEG is a simple loop that does the following:

list pod sandbox containers
list pod containers
wait 1s
return to step 1.

The first two steps result in a list call from OpenShift to docker and can be replicated with curl making a request to the docker socket:

# curl --unix-socket /var/run/docker.sock http://v1.26/containers/json?all=1

PLEG is marked unhealthy if the interval for any of these loops exceeds 3 minutes. An unhealthy PLEG marks a node NotReady to defer the scheduler from scheduling additional pods to the node until the machine stabilizes.

A NotReady node will still work for the amount of time specified by the parameter pod-eviction-timeout (defaults to 5m) set in master-controllers. After that time has passed (and only if the issue persist) the node will be evacuated to prevent a major outage.

On the other hand, if the node is recovered from the condition which marked it as NotReady (such as docker daemon being stable again), the controller will set it up again to Ready status.

Diagnostic Steps

Confirm if the following commands time out:

# curl --unix-socket /var/run/docker.sock http://v1.26/containers/json?all=1

# docker ps 

# docker info

How to automatically gather a coredump right after an OpenShift PLEG issue

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

kubernetes

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.