OCP 4 Node not ready after cluster upgrade or node restart
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- During a upgrade when the nodes are restarting
Issue
- Node is in the NotReady state after the cluster was upgraded
- Node is not becoming ready after node reboot
- Container runtime (crio) on the node is not working properly
- Unable to get a debug shell on the node using
oc debug node/<node-name>because container runtime (crio) is not working - Cannot generate sosreport from the node because container runtime (crio) is not working
Resolution
-
The container runtime needs to be manually cleaned up and restarted.
(Note: The following steps will delete all ephemeral storage that stores container images and container runtime storage.)-
Cordon the node (to avoid any workload getting scheduled if the node gets ready) and then drain the node.
# oc adm cordon node1.example.com # oc adm drain node1.example.com \ --force=true --ignore-daemonsets --delete-emptydir-data --timeout=60s
*Note* older versions use `--delete-local-data` instead of `--delete-emptydir-data`-
Reboot the node and wait for it to come back. Observe the node status again.
# systemctl reboot -
SSH into the node and become user root
-
SSH with the public key provided during the install, or if the password for core was set
# ssh core@node1.example.com # sudo -i -
Login from the console on baremetal or the vSphere admin console if the core password was set
-
-
Stop the kubelet service
# systemctl stop kubelet -
Try manually stopping and removing any running containers/pods using the following commands:
# crictl stopp `crictl pods -q` ## "stopp" with two "p" for stopping pods # crictl stop `crictl ps -aq` # crictl rmp `crictl pods -q` # crictl rmp --force `crictl pods -q` -
Stop the crio service:
# systemctl stop crio
-
Clear the container runtime ephermal storage:
# rm -rf /var/lib/containers/* # crio wipe -f -
Start the crio and kubelet services:
# systemctl start crio # systemctl start kubelet
-
-
If the clean up worked as expected and the crio/kubelet services are started, the node should become ready.
-
Before marking the node schedulable, collect an sosreport from the node to investigate the root cause.
-
Mark the node schedulable
# oc adm uncordon <node1>
Root Cause
The exact root cause is unknown and can vary, but essentially, the container runtime (crio) became unstable and went into an inconsistent state.
Diagnostic Steps
-
Node in
NotReadystate:# oc get nodes NAME STATUS ROLES AGE VERSION <...snip...> node1.example.com NotReady worker 25d v1.19.0+e49167a <...snip...> -
machine-config,networkandmonitoringoperators degraded:# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE <...snip...> machine-config 4.6.16 False False True 8h19m monitoring 4.6.16 False False True 5m58s network 4.6.16 True True True 8h20m <...snip...> -
Node's
Readycondition shows runtime network not ready:# oc get node node1.example.com -o jsonpath='{range .status.conditions[?(.type=="Ready")]}{.message}{"\n"}{end}' 'runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?' -
All pods on the node are in
Pendingstate:# oc get pods -A -o wide | grep 'node1.example.com' NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE <...snip...> openshift-sdn ovs-l88hd 0/1 Pending 0 4h2m 10.8.23.3 node1.example.com openshift-machine-config-operator machine-config-daemon-v45wq 0/2 Pending 0 3d 10.8.23.3 node1.example.com openshift-dns dns-default-jwwk2 0/3 Pending 0 3d node1.example.com <...snip...> -
crioservice on the node complains about being unable to delete old pods and/or unknown CNI cache filesFeb 13 18:57:57 node1.example.com crio[1732]: time="2021-02-13 18:57:57.132606810Z" level=warning msg="Stopping container cee05940dbc79968cbca346801f287e99994095000af13aa188b3347bee8a15c with stop signal timed out: timeout reached after 30 seconds waiting for container process to exit" Feb 14 04:05:47 node1.example.com crio[1732]: time="2021-02-14 04:05:47.134324953Z" level=warning msg="Stopping container 045670ee4c8e3eed85d2993f914fabcb9d6c43b409505e4006fc9798078583d4 with stop signal timed out: timeout reached after 30 seconds waiting for container process to exit" <...snip...> Feb 14 04:08:15 node1.example.com crio[1732]: time="2021-02-14 04:08:15.032567695Z" level=warning msg="Unknown CNI cache file /var/lib/cni/results/openshift-sdn-c1c88dcb926a41151e8815cc03dc33ad64bdec53218868e35453be38adc77e77-eth0 kind \"\"" Feb 14 04:08:15 node1.example.com crio[1732]: time="2021-02-14 04:08:15.231010108Z" level=warning msg="Unknown CNI cache file /var/lib/cni/results/openshift-sdn-4ee4131d45452d65baf8d91ad600ca4f30a3614f87ad0b645032409b836be43e-eth0 kind \"\"" -
Pods getting scheduled on the node are failing to start with reason
FailedCreatePodSandBoxand errors likestat /usr/bin/pod: no such file or directoryandcgroup: subsystem does not exist# oc get events -n openshift-sdn LAST SEEN TYPE REASON OBJECT MESSAGE 2m16s Warning FailedCreatePodSandBox pod/ovs-l88hd (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-02-14T12:43:29Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"/usr/bin/pod\": stat /usr/bin/pod: no such file or directory" 2m22s Warning FailedCreatePodSandBox pod/sdn-b84l4 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-02-14T12:43:24Z" level=warning msg="cgroup: subsystem does not exist" time="2021-02-14T12:43:24Z" level=warning msg="cgroup: subsystem does not exist" time="2021-02-14T12:43:24Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"/usr/bin/pod\": stat /usr/bin/pod: no such file or directory"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.