Pods are not getting allocated evenly on worker nodes in OpenShift
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 3
- 4
- Scheduler
Issue
- Why RHOCP allocates more
podsto particularnodescausing resource crunch on them when othernodeshave more resources available? - Is the Scheduler scheduling
podsincorrectly? - How Scheduler schedules the
podsin OpenShift? - A
nodesystem load is very high, while othernodeshave available resources.
Resolution
The Scheduler is responsible for placing pods in nodes. The default Scheduler checks the requests configured for the pods, and pods without requests can be scheduled in nodes with high load. It is needed to ensure custom pods have configured correct requests to allow the Scheduler to properly schedule the pods in the nodes.
For Advance Scheduling, refer to:
- This page is not included, but the link has been rewritten to point to the nearest parent document.Advanced Scheduling in OpenShift 3.
- Controlling pod placement using the scheduler in OpenShift 4.
Root Cause
The Scheduler is responsible for placing pods in nodes. Refer to understanding default scheduling for additional information abut the default Scheduler.
Notes:
By specifying resource
requestsone is specifying minimum amount of resources apodneeds. This info Scheduler uses to schedule apodto anode. Eachnodehas certain CPU and memory it can allocate topods. Scheduler only checks unallocated resources to meetpod's resource requirement. If the amount of unallocated resources CPU/Memory is less than whatpodrequests, thepodwill not be scheduled on thatnodebecausenodedoesn't meetpod's minimum requirement.Scheduler doesn't look at how much of each individual resource is being used at the exact time of scheduling. Instead looks at the sum of resources requested by the existing
podsdeployed on thenode(so fromfree -hcheck foravailable memorycolumn and notfreecolumn). Even though existingpodsmay be using less than what they have requested, scheduling anotherpodbased on actual resource consumption would break the assurance given to already deployedpods.Refer to memory available in a node from the perspectives of Kubelet and the default OpenShift scheduler respectively for additional information about the memory checked for scheduling new pods in a node.
Diagnostic Steps
-
Check for the
availablememory (notfree) of a particularnode:$ free -h total used free shared buff/cache available Mem: 251G 64G 17G 89G 169G 97G Swap: 0B 0B 0B -
Check for CPU and Memory
requestsandlimits, and number ofpodsrunning on a particularnode:$ oc describe nodes | awk 'BEGIN{ovnsubnet="";printf "|%s|||||%s|||||%s||%s\n%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n","CPU","MEM","PODs","OVN","NODENAME","Allocatable","Request","(%)","Limit","(%)","Allocatable","Request","(%)","Limit","(%)","Allocatable","Running","Node Subnet"}{if($1 == "Name:"){name=$2};if($1 == "k8s.ovn.org/node-subnets:"){ovnsubnet=$2};if($1 ~ "Allocatable:"){while($1 != "System"){if($1 == "cpu:"){Alloc_cpu=$2};if($1 == "memory:"){Alloc_mem=$2};if($1 == "pods:"){Alloc_pod=$2};getline}};if($1 == "Namespace"){getline;getline;pods_count=0;while($1 != "Allocated"){pods_count++;getline}};if($1 == "Resource"){while($1 != "Events:"){if($1 == "cpu"){req_cpu=$2;preq_cpu=$3;lim_cpu=$4;plim_cpu=$5};if($1 == "memory"){req_mem=$2;preq_mem=$3;lim_mem=$4;plim_mem=$5};getline};printf "%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s|%s\n",name,Alloc_cpu,req_cpu,preq_cpu,lim_cpu,plim_cpu,Alloc_mem,req_mem,preq_mem,lim_mem,plim_mem,Alloc_pod,pods_count,ovnsubnet}}' | sed -e "s/{\"default\":\[\{0,1\}\"\([.\/0-9]*\)\"\]\{0,1\}\]}/\1/" | column -s'|' -t CPU MEM PODs OVN NODENAME Allocatable Request (%) Limit (%) Allocatable Request (%) Limit (%) Allocatable Running Node Subnet master-0 3500m 2326m (66%) 6010m (171%) 15225968Ki 7012Mi (47%) 2Gi (13%) 250 39 10.1.0.0/23 master-1 3500m 2700m (77%) 6010m (171%) 15225968Ki 8665Mi (58%) 2Gi (13%) 250 65 10.2.0.0/23 master-2 3500m 2499m (71%) 6010m (171%) 15225968Ki 7563Mi (50%) 2Gi (13%) 250 43 10.3.0.0/23 worker-0 3500m 3097m (88%) 15010m (428%) 6980720Ki 6788Mi (99%) 6528Mi (95%) 250 50 10.4.0.0/23 worker-1 3500m 1909m (54%) 14410m (411%) 6980728Ki 6798Mi (99%) 9360Mi (137%) 250 42 10.5.0.0/23 worker-2 3500m 1447m (41%) 22010m (628%) 6980728Ki 6794Mi (99%) 26240Mi (384%) 250 42 10.6.0.0/23 -
Check if any Advanced Scheduling for pod placement is already configured.
-
Find
podswithoutrequestsusing the script described in OpenShift nodes are being overloaded and going into NotReady state. -
Find additional information about pods with the script described in how to get an insight about containers status, restarts, exit code, reason and limits in OCP4?.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.