Upgrading OpenShift master node resources hosted on OpenStack platform
Environment
- Red Hat OpenShift Container Platform 4.8 and later
- Red Hat OpenStack Platform 13.x and 16.x
Issue
- Facing high resource consumption on the cluster's master nodes
- Current memory set 16G for all master nodes, which is the bare minimum
Resolution
-
A rolling update resizing for the master nodes is required i.e. one node at a time.
-
This particular example can be used to increase the memory and vcpus of the master node
-
Increasing the master node disk size requires some other considerations and is not covered here.
-
The suggestion is the increase the memory to 32G at least or 64G or as per your requirement.
-
Here are the steps to increase the master node memory considering it an OpenStack instance:
-
Take a backup/snapshot of the master instance
$ openstack server list $ openstack server image create --name <image_name> <instance>image-nameis your snapshot nameinstanceis your ID if the master node server.
-
Take ETCD backup following the official document.
-
Make sure all the cluster operators are stable
$ oc get co -w -
Make sure that the rest of the masters are completely healthy.
$ oc get nodes | grep {master} -
Mark the node as unschedulable.
$ oc adm cordon ${node_name} // check the node status is Ready,SchedulingDisabled -
Evacuate the pods
$ oc adm drain <NODENAME> --delete-emptydir-data --grace-period=1 --ignore-daemonsets -
Check the master node flavor from openstack end.
$ openstack server show <master ID> $ openstack flavor show <flavor ID> -
Create a new flavor or if you already have a flavor with the intended memory size that you are expecting use the same. i.e 32/64G
$ openstack flavor list $ openstack flavor create --ram <size_mb> --disk <size_gb> --vcpus <no_vcpus> --project <project_id> <flavor_name>IMPORTANT:
- If you are creating a new flavor or using a pre-existing one, make sure parameters like
--diskshould be the same as your existing flavor.
- If you are creating a new flavor or using a pre-existing one, make sure parameters like
-
Resize the master node instance
$ openstack server resize --flavor <flavor> --wait <instance>- Replace
with the name or ID of the flavor that you retrieved in step 8. - Replace
with the name or ID of the master node instance that you are resizing.
- Replace
-
Confirm the resize operator (resizing takes time ~10 mins avg)
- Note that resizing can take time. The operating system on the instance performs a controlled shutdown before the instance is powered off and the instance is resized. During this time, the instance status is
RESIZE. When the resize completes, the instance status changes toVERIFY_RESIZE.
$ openstack server resize confirm <instance ID> - Note that resizing can take time. The operating system on the instance performs a controlled shutdown before the instance is powered off and the instance is resized. During this time, the instance status is
-
Mark the node as schedulable when done.
$ oc adm uncordon <master>RESCUE:
- For any problems with the resize, you can revert the operation anytime using below steps:
$ openstack server resize revert <instance> $ oc adm uncordon <master>
Root Cause
- For this particular example, the root cause was:
kube-apiserver-labocp-mst2,catalog-operator-75c8ffd69b-q77tf,etcd-labocp-mst2pods were accounting for 30% of the system RAM- Increase API requests, ETCD queries, Logging (openshift-logging/openshift-monitoring) on the control-plane, and many other factors.
- Only a single master node serves as the ETCD leader, the OVN primary, and the Kube-API primary, this might be another reason for increased levels of resource consumption as the cluster grows with pods on other nodes
Diagnostic Steps
-
Worker nodes are deployed on bare metal while master nodes utilize OpenStack Compute nodes.
-
Since OCP master nodes are deployed on OSP compute nodes, precautions should be taken while choosing the right
resource count, Compute node should be having enough resources. -
Current master node resource utilization, where
labocp-mst2memory was 95% utilized[root@ocp-bastion ~]# oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% labocp-wo1 2379m 2% 30831Mi 6% labocp-wo2 2136m 2% 35477Mi 7% labocp-wo3 1297m 1% 61991Mi 12% labocp-mst1 1549m 19% 11029Mi 82% labocp-mst2 1563m 19% 12736Mi 95% labocp-mst3 2235m 28% 8524Mi 63% labocp-wo1 13229m 13% 77129Mi 15% labocp-wo2 4838m 5% 88265Mi 17% labocp-wo3 8006m 8% 82351Mi 16% labocp-wo4 5044m 5% 83211Mi 16%
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.