OpenShift nodes NotReady due to "PLEG is not healthy" issue

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3
    • 4

Issue

  • Nodes are going to NotReady state due to error "container runtime is down, PLEG is not healthy"

Resolution

  1. For OpenShift 3.x, a solution is created to address the PLEG issue and can refer here "PLEG is not healthy" errors on OpenShift 3.x nodes

  2. For OpenShift 4.x, the below steps helps to fix the PLEG errors,

     $ oc adm cordon <node>
     $ oc adm drain <node>
  • Restart the kubelet service and crio services
     $ systemctl restart kubelet
     $ systemctl restart crio
  • Deleting dead containers with the commands below:
	$ sudo crictl ps -a | grep -i exited
	$ sudo crictl rm $(sudo crictl ps -a | grep -i exited | awk '{print $1}')
  • Deleting the untagged images
	$ sudo crictl images | grep "<none>" | awk '{print $3}' | xargs sudo crictl rmi

Root Cause

  • PLEG error might occur due to the following reasons:

    - Container runtime latency or timeout (performance degradation, deadlock, bugs…) during remote requests.
    - Too many running pods for host resources or too many running pods on high-spec hosts to complete the [relist](https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes) within 3 minutes. Events and latency are proportional to the pod numbers regardless of host resources.
    - CNI bugs when getting a pod network status.
    
  • To improve Kubelet scalability and performance by lowering the pod management overhead.

    • Reduce unnecessary work during inactivity (no spec/state changes)
    • Lower the concurrent requests to the container runtime.
  • Make sure that the containers have requests and limits enabled in the cluster as it might cause out of memory situations, and high cpu usage on the nodes, and it could impact node processes.

Diagnostic Steps

  • Check if the kubelet is running
     $ systemctl status kubelet
  • Check if the unused containers and untagged images are present.
  • Check the CPU and memory resources of the cluster nodes.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.