Upgrading OpenShift master node resources hosted on OpenStack platform

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform 4.8 and later
  • Red Hat OpenStack Platform 13.x and 16.x

Issue

  • Facing high resource consumption on the cluster's master nodes
  • Current memory set 16G for all master nodes, which is the bare minimum

Resolution

  • A rolling update resizing for the master nodes is required i.e. one node at a time.

  • This particular example can be used to increase the memory and vcpus of the master node

  • Increasing the master node disk size requires some other considerations and is not covered here.

  • The suggestion is the increase the memory to 32G at least or 64G or as per your requirement.

  • Here are the steps to increase the master node memory considering it an OpenStack instance:

  1. Take a backup/snapshot of the master instance

    $ openstack server list
    $ openstack server image create --name <image_name> <instance>
    
    • image-name is your snapshot name
    • instance is your ID if the master node server.
  2. Take ETCD backup following the official document.

  3. Make sure all the cluster operators are stable

    $ oc get co -w 
    
  4. Make sure that the rest of the masters are completely healthy.

    $ oc get nodes | grep {master}
    
  5. Mark the node as unschedulable.

    $ oc adm cordon ${node_name}  // check the node status is Ready,SchedulingDisabled
    
  6. Evacuate the pods

    $ oc adm drain <NODENAME> --delete-emptydir-data --grace-period=1 --ignore-daemonsets
    
  7. Check the master node flavor from openstack end.

    $ openstack server show <master ID>
    $ openstack flavor show <flavor ID>
    
  8. Create a new flavor or if you already have a flavor with the intended memory size that you are expecting use the same. i.e 32/64G

    $ openstack flavor list
    $ openstack flavor create --ram <size_mb> --disk <size_gb> --vcpus <no_vcpus> --project <project_id> <flavor_name>
    

    IMPORTANT:

    • If you are creating a new flavor or using a pre-existing one, make sure parameters like --disk should be the same as your existing flavor.
  9. Resize the master node instance

    $ openstack server resize --flavor <flavor> --wait <instance>
    
    • Replace with the name or ID of the flavor that you retrieved in step 8.
    • Replace with the name or ID of the master node instance that you are resizing.
  10. Confirm the resize operator (resizing takes time ~10 mins avg)

    • Note that resizing can take time. The operating system on the instance performs a controlled shutdown before the instance is powered off and the instance is resized. During this time, the instance status is RESIZE. When the resize completes, the instance status changes to VERIFY_RESIZE.
    $ openstack server resize confirm <instance ID>
    
  11. Mark the node as schedulable when done.

    $ oc adm uncordon <master>
    

    RESCUE:

    • For any problems with the resize, you can revert the operation anytime using below steps:
    $ openstack server resize revert <instance>
    $ oc adm uncordon <master>
    

Root Cause

  • For this particular example, the root cause was:
  1. kube-apiserver-labocp-mst2, catalog-operator-75c8ffd69b-q77tf, etcd-labocp-mst2 pods were accounting for 30% of the system RAM
  2. Increase API requests, ETCD queries, Logging (openshift-logging/openshift-monitoring) on the control-plane, and many other factors.
  3. Only a single master node serves as the ETCD leader, the OVN primary, and the Kube-API primary, this might be another reason for increased levels of resource consumption as the cluster grows with pods on other nodes

Diagnostic Steps

  • Worker nodes are deployed on bare metal while master nodes utilize OpenStack Compute nodes.

  • Since OCP master nodes are deployed on OSP compute nodes, precautions should be taken while choosing the right
    resource count, Compute node should be having enough resources.

  • Current master node resource utilization, where labocp-mst2 memory was 95% utilized

    [root@ocp-bastion ~]# oc adm top node
    NAME         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
    labocp-wo1   2379m        2%     30831Mi         6%
    labocp-wo2   2136m        2%     35477Mi         7%
    labocp-wo3   1297m        1%     61991Mi         12%
    labocp-mst1  1549m        19%    11029Mi         82%
    labocp-mst2  1563m        19%    12736Mi         95%
    labocp-mst3  2235m        28%    8524Mi          63%
    labocp-wo1   13229m       13%    77129Mi         15%
    labocp-wo2   4838m        5%     88265Mi         17%
    labocp-wo3   8006m        8%     82351Mi         16%
    labocp-wo4   5044m        5%     83211Mi         16%
    
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.