Pods are not getting allocated evenly on worker nodes in OpenShift

Solution Verified - Updated

Environment

  • Red Hat Openshift Container Platform (RHOCP)
    • 3
    • 4
  • Scheduler

Issue

  • Why RHOCP allocates more pods to particular nodes causing resource crunch on them when other nodes have more resources available?
  • Is the Scheduler scheduling pods incorrectly?
  • How Scheduler schedules the pods in OpenShift?
  • A node system load is very high, while other nodes have available resources.

Resolution

The Scheduler is responsible for placing pods in nodes. The default Scheduler checks the requests configured for the pods, and pods without requests can be scheduled in nodes with high load. It is needed to ensure custom pods have configured correct requests to allow the Scheduler to properly schedule the pods in the nodes.

For Advance Scheduling, refer to:

Root Cause

The Scheduler is responsible for placing pods in nodes. Refer to understanding default scheduling for additional information abut the default Scheduler.

Notes:

  • By specifying resource requests one is specifying minimum amount of resources a pod needs. This info Scheduler uses to schedule a pod to a node. Each node has certain CPU and memory it can allocate to pods. Scheduler only checks unallocated resources to meet pod's resource requirement. If the amount of unallocated resources CPU/Memory is less than what pod requests, the pod will not be scheduled on that node because node doesn't meet pod's minimum requirement.

  • Scheduler doesn't look at how much of each individual resource is being used at the exact time of scheduling. Instead looks at the sum of resources requested by the existing pods deployed on the node (so from free -h check for available memory column and not free column). Even though existing pods may be using less than what they have requested, scheduling another pod based on actual resource consumption would break the assurance given to already deployed pods.

  • Refer to memory available in a node from the perspectives of Kubelet and the default OpenShift scheduler respectively for additional information about the memory checked for scheduling new pods in a node.

Diagnostic Steps

  1. Check for the available memory (not free) of a particular node:

    $ free -h  
                   total        used        free      shared  buff/cache   available
    Mem:           251G         64G         17G         89G        169G         97G
    Swap:            0B          0B          0B
    
  2. Check for CPU and Memory requests and limits, and number of pods running on a particular node:

        $ oc describe nodes  | awk 'BEGIN{ovnsubnet="";printf "|%s|||||%s|||||%s||%s\n%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n","CPU","MEM","PODs","OVN","NODENAME","Allocatable","Request","(%)","Limit","(%)","Allocatable","Request","(%)","Limit","(%)","Allocatable","Running","Node Subnet"}{if($1 == "Name:"){name=$2};if($1 == "k8s.ovn.org/node-subnets:"){ovnsubnet=$2};if($1 ~ "Allocatable:"){while($1 != "System"){if($1 == "cpu:"){Alloc_cpu=$2};if($1 == "memory:"){Alloc_mem=$2};if($1 == "pods:"){Alloc_pod=$2};getline}};if($1 == "Namespace"){getline;getline;pods_count=0;while($1 != "Allocated"){pods_count++;getline}};if($1 == "Resource"){while($1 != "Events:"){if($1 == "cpu"){req_cpu=$2;preq_cpu=$3;lim_cpu=$4;plim_cpu=$5};if($1 == "memory"){req_mem=$2;preq_mem=$3;lim_mem=$4;plim_mem=$5};getline};printf "%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n",name,Alloc_cpu,req_cpu,preq_cpu,lim_cpu,plim_cpu,Alloc_mem,req_mem,preq_mem,lim_mem,plim_mem,Alloc_pod,pods_count,ovnsubnet}}' | sed -e "s/{\"default\":\[\{0,1\}\"\([.\/0-9]*\)\"\]\{0,1\}\]}/\1/" | column -s'|' -t
    
                  CPU                                      MEM                                      PODs                OVN
        NODENAME  Allocatable Request (%)   Limit  (%)     Allocatable Request (%)   Limit   (%)    Allocatable Running Node Subnet
        master-0  3500m       2326m   (66%) 6010m  (171%)  15225968Ki  7012Mi  (47%) 2Gi     (13%)  250         39      10.1.0.0/23
        master-1  3500m       2700m   (77%) 6010m  (171%)  15225968Ki  8665Mi  (58%) 2Gi     (13%)  250         65      10.2.0.0/23
        master-2  3500m       2499m   (71%) 6010m  (171%)  15225968Ki  7563Mi  (50%) 2Gi     (13%)  250         43      10.3.0.0/23
        worker-0  3500m       3097m   (88%) 15010m (428%)  6980720Ki   6788Mi  (99%) 6528Mi  (95%)  250         50      10.4.0.0/23
        worker-1  3500m       1909m   (54%) 14410m (411%)  6980728Ki   6798Mi  (99%) 9360Mi  (137%) 250         42      10.5.0.0/23
        worker-2  3500m       1447m   (41%) 22010m (628%)  6980728Ki   6794Mi  (99%) 26240Mi (384%) 250         42      10.6.0.0/23
    
  3. Check if any Advanced Scheduling for pod placement is already configured.

  4. Find pods without requests using the script described in OpenShift nodes are being overloaded and going into NotReady state.

  5. Find additional information about pods with the script described in how to get an insight about containers status, restarts, exit code, reason and limits in OCP4?.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.