OCP 4 Node not ready after cluster upgrade or node restart

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • During a upgrade when the nodes are restarting

Issue

  • Node is in the NotReady state after the cluster was upgraded
  • Node is not becoming ready after node reboot
  • Container runtime (crio) on the node is not working properly
  • Unable to get a debug shell on the node using oc debug node/<node-name> because container runtime (crio) is not working
  • Cannot generate sosreport from the node because container runtime (crio) is not working

Resolution

  • The container runtime needs to be manually cleaned up and restarted.
    (Note: The following steps will delete all ephemeral storage that stores container images and container runtime storage.)

    1. Cordon the node (to avoid any workload getting scheduled if the node gets ready) and then drain the node.

       # oc adm cordon node1.example.com
      
       # oc adm drain node1.example.com \
           --force=true --ignore-daemonsets --delete-emptydir-data --timeout=60s
      
    *Note* older versions use `--delete-local-data` instead of `--delete-emptydir-data`
    
    1. Reboot the node and wait for it to come back. Observe the node status again.

       # systemctl reboot
      
    2. SSH into the node and become user root

      • SSH with the public key provided during the install, or if the password for core was set

          # ssh core@node1.example.com
          # sudo -i
        
      • Login from the console on baremetal or the vSphere admin console if the core password was set

    3. Stop the kubelet service

       # systemctl stop kubelet
      
    4. Try manually stopping and removing any running containers/pods using the following commands:

       # crictl stopp `crictl pods -q`        ##  "stopp" with two "p" for stopping pods
       # crictl stop `crictl ps -aq`
       # crictl rmp `crictl pods -q`
       # crictl rmp --force `crictl pods -q`
      
    5. Stop the crio service:

       # systemctl stop crio
      
    • Clear the container runtime ephermal storage:

        # rm -rf /var/lib/containers/*
        # crio wipe -f
      
    • Start the crio and kubelet services:

        # systemctl start crio
        # systemctl start kubelet
      
  • If the clean up worked as expected and the crio/kubelet services are started, the node should become ready.

  • Before marking the node schedulable, collect an sosreport from the node to investigate the root cause.

  • Mark the node schedulable

      # oc adm uncordon <node1>
    

Root Cause

The exact root cause is unknown and can vary, but essentially, the container runtime (crio) became unstable and went into an inconsistent state.

Diagnostic Steps

  • Node in NotReady state:

      # oc get nodes
      NAME                        STATUS    ROLES         AGE  VERSION
      <...snip...>
      node1.example.com           NotReady  worker        25d  v1.19.0+e49167a
      <...snip...>
    
  • machine-config, network and monitoring operators degraded:

      # oc get co
      NAME                                      VERSION  AVAILABLE  PROGRESSING  DEGRADED  SINCE
      <...snip...>
      machine-config                            4.6.16   False      False        True      8h19m
      monitoring                                4.6.16   False      False        True      5m58s
      network                                   4.6.16   True       True         True      8h20m
      <...snip...>
    
  • Node's Ready condition shows runtime network not ready:

      # oc get node node1.example.com -o jsonpath='{range .status.conditions[?(.type=="Ready")]}{.message}{"\n"}{end}'
      'runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
       message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/.
       Has your network provider started?'
    
  • All pods on the node are in Pending state:

      # oc get pods -A -o wide | grep 'node1.example.com'
      NAMESPACE                             NAME                          READY  STATUS     RESTARTS  AGE    IP            NODE
      <...snip...>
      openshift-sdn                         ovs-l88hd                     0/1    Pending    0         4h2m   10.8.23.3     node1.example.com
      openshift-machine-config-operator     machine-config-daemon-v45wq   0/2    Pending    0         3d     10.8.23.3     node1.example.com
      openshift-dns                         dns-default-jwwk2             0/3    Pending    0         3d                   node1.example.com
      <...snip...>
    
  • crio service on the node complains about being unable to delete old pods and/or unknown CNI cache files

      Feb 13 18:57:57 node1.example.com crio[1732]: time="2021-02-13 18:57:57.132606810Z" level=warning msg="Stopping container cee05940dbc79968cbca346801f287e99994095000af13aa188b3347bee8a15c with stop signal timed out: timeout reached after 30 seconds waiting for container process to exit"
      Feb 14 04:05:47 node1.example.com crio[1732]: time="2021-02-14 04:05:47.134324953Z" level=warning msg="Stopping container 045670ee4c8e3eed85d2993f914fabcb9d6c43b409505e4006fc9798078583d4 with stop signal timed out: timeout reached after 30 seconds waiting for container process to exit"
      
      <...snip...>
      
      Feb 14 04:08:15 node1.example.com crio[1732]: time="2021-02-14 04:08:15.032567695Z" level=warning msg="Unknown CNI cache file 
      /var/lib/cni/results/openshift-sdn-c1c88dcb926a41151e8815cc03dc33ad64bdec53218868e35453be38adc77e77-eth0 kind \"\""
      Feb 14 04:08:15 node1.example.com crio[1732]: time="2021-02-14 04:08:15.231010108Z" level=warning msg="Unknown CNI cache file 
      /var/lib/cni/results/openshift-sdn-4ee4131d45452d65baf8d91ad600ca4f30a3614f87ad0b645032409b836be43e-eth0 kind \"\""
    
  • Pods getting scheduled on the node are failing to start with reason FailedCreatePodSandBox and errors like stat /usr/bin/pod: no such file or directory and cgroup: subsystem does not exist

      # oc get events -n openshift-sdn
      LAST SEEN  TYPE     REASON                  OBJECT         MESSAGE
      2m16s      Warning  FailedCreatePodSandBox  pod/ovs-l88hd  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-02-14T12:43:29Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"/usr/bin/pod\": stat /usr/bin/pod: no such file or directory"
      2m22s      Warning  FailedCreatePodSandBox  pod/sdn-b84l4  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-02-14T12:43:24Z" level=warning msg="cgroup: subsystem does not exist"
                                                              time="2021-02-14T12:43:24Z" level=warning msg="cgroup: subsystem does not exist"
                                                              time="2021-02-14T12:43:24Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"/usr/bin/pod\": stat /usr/bin/pod: no such file or directory"
    
SBR
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.