Pods are not rescheduled when node becomes NotReady

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3
    • 4

Issue

  • During failover testing, bringing a node offline does not cause a pod that was scheduled on that node to reschedule on another node.
  • I lost a node, and the pods on the node did not come back up on another node as they should have. This resulted in unavailable services.
  • One node has become NotReady, and now there are pods stuck in Terminating state.
  • Pods on Node in state NotReady not rescheduled and stay in Running after 15 minutes.

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

  • Check the pod's yaml, and/or the DaemonSet and StatefulSet for advanced scheduling settings, such as a node selector or an affinity/anti-affinity. Check the project/namespace configuration for the same values, as either can be causing the pod to not reschedule after OpenShift's default timeout interval of 10 minutes:

      nodeSelector:
        node-role.kubernetes.io/infra-efk: "true"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: prometheus
                  operator: In
                  values:
                  - k8s
              namespaces:
              - openshift-monitoring
              topologyKey: kubernetes.io/hostname
            weight: 100
    

    If this is not the cause, or the advanced scheduling is there by default, consult the Content from kubernetes.io is not included.Kubernetes documentation on StatefulSets. When a node is unplugged, the pod state enters Terminating or Unknown status due to a timeout. The StatefulSet controller will not reschedule the Pods because it can not correctly decide whether the node has network problem or the node is powered off. Rescheduling the Pod on other node will potentially cause multiple identity Pods for StatefulSet and it could cause potential split-brain. The only way to allow this terminating pod to leave gracefully is to delete the node service, or bring the node back online. This will allow etcd to query the pod, see that it's dead, and mark it as so. In most cases, the pod will regenerate on a different node when its node is unplugged, unless there is advanced scheduling that prevents it from doing so.

  • This is expected behaviour for pods belonging to a daemonset, because by design they are not evicted, not rescheduled and can keep Running on the node even if it becomes unreachable. The This content is not included.Bug 2088726 - oc is not reporting pods with status NodeLost for Daemonset pods when a node is marked NotReady was opened to explore the possibility of improving the way oc CLI reports this kind of pods when the related node is NotReady in RHOCP 4.

Root Cause

  • Advanced scheduling in the pod or project config is preventing it from rescheduling by design, or the node needs to become Ready again so that the pod can regenerate or communicate its status with the cluster. From the above example:

  • We have a nodeSelector pointing it to a node with a label of "node-role.kubernetes.io/indra-efk." If this node is taken offline, and is the only node that contains this label, the pod will not reappear.

  • We have an anti-affinity with a weight of 100. If the other nodes match the condition specified here, it will not schedule on those nodes. The inverse is true for pod affinity. There is more documentation regarding affinity and anti affinity here, and in the This page is not included, but the link has been rewritten to point to the nearest parent document.official documentation.

Diagnostic Steps

  1. Check which pods are running on a node with:

    oc get nodes -o wide
    
  2. Bring that node offline

  3. If the pods don't reschedule, check configurations with:

    oc get pod <pod-id> -o yaml
    oc get project <projectname> -o yaml
    
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.