OpenShift nodes NotReady due to "PLEG is not healthy" issue
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 3
- 4
Issue
- Nodes are going to
NotReadystate due to error "container runtime is down, PLEG is not healthy"
Resolution
-
For OpenShift 3.x, a solution is created to address the PLEG issue and can refer here "PLEG is not healthy" errors on OpenShift 3.x nodes
-
For OpenShift 4.x, the below steps helps to fix the PLEG errors,
- Cordon and drain the node. Refer to "Understanding how to evacuate pods on nodes"
$ oc adm cordon <node>
$ oc adm drain <node>
- Restart the
kubeletservice andcrioservices
$ systemctl restart kubelet
$ systemctl restart crio
- Deleting dead containers with the commands below:
$ sudo crictl ps -a | grep -i exited
$ sudo crictl rm $(sudo crictl ps -a | grep -i exited | awk '{print $1}')
- Deleting the untagged images
$ sudo crictl images | grep "<none>" | awk '{print $3}' | xargs sudo crictl rmi
Root Cause
-
PLEG error might occur due to the following reasons:
- Container runtime latency or timeout (performance degradation, deadlock, bugs…) during remote requests. - Too many running pods for host resources or too many running pods on high-spec hosts to complete the [relist](https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes) within 3 minutes. Events and latency are proportional to the pod numbers regardless of host resources. - CNI bugs when getting a pod network status. -
To improve
Kubeletscalability and performance by lowering the pod management overhead.- Reduce unnecessary work during inactivity (no spec/state changes)
- Lower the concurrent requests to the container runtime.
-
Make sure that the containers have
requestsandlimitsenabled in the cluster as it might cause out of memory situations, and high cpu usage on the nodes, and it could impact node processes.
Diagnostic Steps
- Check if the
kubeletis running
$ systemctl status kubelet
- Check if the unused containers and untagged images are present.
- Check the CPU and memory resources of the cluster nodes.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.