Troubleshooting Red Hat OpenShift Container Platform 4: Node NotReady

Updated

The NotReady status in a node can be caused by different issues, but the main reason is usually that the kubelet.service in the node is not running or unable to connect to the API Server.

Index

Check for CPU soft lockups

Connect to the node(s) and check for soft lockup messages in the journal or dmesg as explained in virtual machine reports a "BUG: soft lockup" (or multiple at the same time):

$ ssh -i <path to sshkey> core@<node IP>
[...]
$ sudo journalctl --no-pager | grep -c "soft lockup"
[...]
$ sudo journalctl --no-pager | grep "soft lockup"

Check Kubelet status

Connect into the node with SSH as core user using the sshkey configured at installation phase:

$ ssh -i <path to sshkey> core@<node IP>

Verify if kubelet.serviceis active and running:

$ sudo systemctl is-active kubelet.service
active 

If an OCP node is not ready after cluster upgrade or node restart then perform this procedure

Node under Load

Check if the node has enough resources using below command and probable reason in the events:

$ oc describe node <node-name>
---
Allocated resources:
Total limits may be over 100 percent, i.e., overcommitted.)
Resource                    Requests     Limits
--------                    --------     ------
cpu                         380m (25%)   270m (18%)
memory                      880Mi (11%)  250Mi (3%)
attachable-volumes-aws-ebs  0            0
Events:     
Type     Reason                   Age                From                      Message
----     ------                   ----               ----                      -------
Normal   NodeHasSufficientPID     6d (x5 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
Normal   NodeAllocatableEnforced  6d                 kubelet, m01.example.com  Updated Node Allocatable limit across pods
Normal   NodeHasSufficientMemory  6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientMemory
Normal   NodeHasNoDiskPressure    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasNoDiskPressure
Normal   NodeHasSufficientDisk    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientDisk
Normal   NodeHasSufficientPID     6d                 kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
Normal   Starting                 6d                 kubelet, m01.example.com  Starting kubelet.

Verify CSR in Pending status

$ oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-244x8   13h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-24kfd   20h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-256z7   18h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

If CSR in Pending status are present, approve them with the following command

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve

Check certificate rotation process

Search for certificate rotation errors in Kubelet log:

$ journalctl -u kubelet.service --no-pager |grep "csr.go" 
$ journalctl -u kubelet.service --no-pager |grep "certificate_manager.go" 

Look for NotReady events in the Kubelet log

Look for the NodeNotReady or KubeletNotReady events in the journal logs of the kubelet:

$ journalctl -u kubelet.service --no-pager |grep -i  KubeletNotReady
Feb 19 17:13:52.877852 ip-10-0-138-243 hyperkube[1486]: I0219 17:13:52.877752    1486 setters.go:555] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2021-02-19 17:13:52.87773318 +0000 UTC m=+116886.246819463 LastTransitionTime:2021-02-19 17:13:52.87773318 +0000 UTC m=+116886.246819463 Reason:KubeletNotReady Message:container runtime is down}

Check "PLEG is not healthy"

Verify if PLEG is not healthymessages are present in the Kubelet log:

$ journalctl -u kubelet.service --no-pager | grep "PLEG is not healthy"
Feb 18 08:45:52.293400 ip-10-0-138-243 hyperkube[1486]: E0218 08:45:52.293288    1486 kubelet.go:1772] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]

If present, this article can help to understand it Understanding: PLEG is not healthy.

Node unable to contact the API server due to network issue

Check the kubelet journal for errors related to node status updates. In the example below the kubelet is not able to do a DNS lookup of the IP adress of the internal API load balancer.

$ journalctl --no-pager | grep "Error updating node status"
Jul 18 11:57:31 mynodename kubenswrapper[2782]: E0718 11:57:31.540565    2782 kubelet_node_status.go:487] "Error updating node status, will retry" err="error getting node \"mynodename\": Get \"https://api-int.cluster-name:6443/api/v1/nodes/mynodename.cluster-name?timeout=10s\": dial tcp: lookup api-int.cluster-name on x.x.x.x:53: dial udp x.x.x.x:53: connect: network is unreachable"

Check MCO is not in degraded state

Check the machine-config operator clusteroperator resource for any errors which would indicate a problem with specific nodes or machineconfigpools. Refer Link

$ oc describe clusteroperator machine-config

Checking status of nodes

$ for node in $(oc get nodes -o name | awk -F'/' '{ print $2 }');do echo "-------------------- $node ------------------"; oc describe node $node | grep machineconfiguration.openshift.io/state; done

Understand known causes for MachineConfig Operator in degraded state.

Node hang/Kernel Crashes/OOM issues

OpenShift Container Platform runs on RHCOS, which is based on RHEL. Refer to OpenShift 4 node with cgroup out of memory and oom-kill errors for troubleshooting OOM issues. For additional troubleshooting and investigate Kernel issues refer to troubleshooting operating system issues.

CRI-O container runtime issues

Verify if CRI-O panics below stack straces are seen in the logs:

$ journalctl -u kubelet.service --no-pager
Mar 14 15:23:27 worker-1 systemd[1]: crio-conmon-5e4a5efe282d3f77fe472d8810fa9a8a61 df545a6087a7e8ecaa9379b7f1fa5c.scope: Consumed 55ms CPU time
Mar 14 15:23:27 worker-1 crio[1636]: panic: close of closed channel
Mar 14 15:23:27 worker-1 crio[1636]: goroutine 5778599 [running]:
Mar 14 15:23:27 worker-1 crio[1636]: panic(0x55c2b280a280, 0x55c2b2aa4f90)
Mar 14 15:23:27 worker-1 crio[1636]:         /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc001827530 sp=0xc001827468 pc=0x55c2b098e8a5
Mar 14 15:23:27 worker-1 crio[1636]: runtime.closechan(0xc00187e300)
Mar 14 15:23:27 worker-1 crio[1636]:         /usr/lib/golang/src/runtime/chan.go:363 +0x3f5 fp=0xc001827570 sp=0xc001827530 pc=0x55c2b095cbb5

Refer crio panics with panic: close of closed channel

Troubleshooting CRI-O container runtime issues

Check resources usage

If the SystemMemoryExceedsReservation alert is observed, refer to SystemMemoryExceedsReservation alert received in OpenShift 4.

Verify node performance if any lack of memory or CPU, or I/O bandwith is occurring. Connect to the OpenShift Container Platform web console as cluster-admin and go to Monitoring->Dashboards and select the "Kubernetes / Compute Resources / Node (Pods)" dashboard to search for any spike in the resource consumption:

"Kubernetes / Compute Resources / Node (Pods)" Dashboard.
"Kubernetes / Compute Resources / Node (Pods)" Dashboard.

From this view is possible to determinate if there is any pod consuming too many resources, a good habit is to configure resources requests and limits. Refer to This content is not included.how to specify limits and requests for cpu and memory in OpenShift for additional information.

SBR
Category
Components
Article Type