How to Safely Increase Resources on OpenShift Data Foundation (ODF) Storage Nodes

Solution Verified - Updated

Environment

Red Hat OpenShift Container Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Red Hat Ceph Storage (RHCS) 8.x

Issue

  • ODF Storage nodes have specific architecture requirements defined in section 7.3. Resource requirements of the Product Documentation.

  • In some instances, especially after ODF storage devices/disks have been scaled up (added), users may experience pods going into a Pending status and not being able to schedule pods. This is because with each ODF device added to a storage node, an additional 2 CPU and 5GiB of memory will be requested from the node regardless of the device size.

  • Although there are many different infrastructure types (AWS, VMWare, Bare Metal, Azure, etc.), the process is typically the same. Generic terminology will be used in this solution, however, infrastructure administrators will need to fill in the gaps on how it relates to the specific platform being administered.

Resolution

Prerequisite: For ODF storage nodes, please ensure the rook-ceph-tools pod is enabled by following the Configuring the Rook-Ceph Toolbox in OpenShift Data Foundation 4.x solution.

WARNING: When performing multi-host/node shutdowns/reboots, it is extremely important to view the status of Ceph before/after each host shutdown/reboot and ONLY proceed to the next host when all PGs are active+clean, scrubbing is ok.

  1. Capture the Status of Ceph:
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph status

    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 4h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 170 pgs
    objects: 8.22k objects, 28 GiB
    usage:   84 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     170 active+clean <------------------------------------------------ PGs need to be active+clean, scrubbing is ok as well
 
  io:
    client:   1.5 KiB/s rd, 946 KiB/s wr, 2 op/s rd, 4 op/s wr

NOTE: The functionality of Ceph is to always keep replicated copies of data spread across storage nodes (typically 3 copies is the default). If a host is shut down in a cluster with more than three disks, Ceph will begin moving the data elsewhere. During this process, the PGs will reflect active+undersized+degraded, which means the third copy of the data is missing. Because we're only bringing the storage hosts down temporarily to make the resource change, we want to avoid the "self-healing" process backfilling, remapping, etc. This is to save time, as once the host is brought back up, Ceph will recalculate the data placement and begin moving data back to the original host. To prevent the unnecessary waste of time and I/O, we will temporarily put flags in place to prevent data movement between host shutdowns. This is perfectly safe as long as one host at a time is being worked, and Ceph health is inspected in between shutdowns.

  1. Set the following Ceph flags to avoid PG movement:
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd set nobackfill
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd set norebalance
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd set noout
  1. Cordon the node to be shut down:
$ oc adm cordon <node-name>
  1. After the node has been cordoned, find the OSDs and mons on the host to be shut down, and scale down those resources:
$ oc get pods -owide  -n openshift-storage |grep <node-name>|egrep 'osd|mon' 
$ oc get deployment  -n openshift-storage
$ oc scale deployment <deployment-name> <deployment-name> <deployment-name> --replicas=0  -n openshift-storage

NOTE: Some PGs should now reflect undersized+degraded.

  1. Search for any NooBaa pod scheduled on the node to be shut down, and gracefully delete the pod so the noobaa pod can be scheduled elsewhere:
    NOTE: If there are no NooBaa pods scheduled on the node, skip to step 6; however, if there is one, delete the pod. If there are multiple, delete the noobaa pods in the following order:
  • nooba-operator
  • nooba-core
  • nooba-endpoint
  • noobaa-db
$ oc get pods -o wide  -n openshift-storage|grep <node-name>
$ oc delete pod <pod> -n openshift-storage
  1. Drain the node:
$ oc adm drain <NODENAME> --delete-emptydir-data --grace-period=1 --ignore-daemonsets
  1. Once the node drain message shows the drain was successful, shutdown the node/host (there are many ways to do this, below is an example):
$ oc debug node/<node-name>
$ chroot /host
$ shutdown now
  1. When the node is in a NotReady, SchedulingDisabled state, increase the resources on the host. For example:
  • Change the AWS/Azure Instance.
  • Edit the CPU/Memory of the VM in the VMWare Console.
  • Add hardware to Bare Metal.
  1. Power on the node, once the node becomes Ready, SchedulingDisabled, uncordon the node:
$ oc adm uncordon <node-name>
  1. Describe the node to validate that the changes are updated on the node:
$ oc describe node <node-name>

Capacity:
  cpu:                12  <------------------------------------------------------
  ephemeral-storage:  125293548Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             33554432Ki  <------------------------------------
  1. Scale up the previously scaled-down mon and OSD deployments:
$ oc scale deployment <deployment-name> <deployment-name> <deployment-name> --replicas=1  -n openshift-storage
  1. Monitor Ceph:

WARNING: Do not shutdown the next node until all PGs are active+clean, scrubbing is ok.

$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph status

    health: HEALTH_WARN
            noout,nobackfill,norebalance flag(s) set
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 5h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
         flags noout,nobackfill,norebalance
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 170 pgs
    objects: 8.78k objects, 30 GiB
    usage:   90 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     170 active+clean  <------------------------------------------------ PGs need to be active+clean, scrubbing is ok as well
 
  io:
    client:   11 KiB/s rd, 3.7 MiB/s wr, 10 op/s rd, 8 op/s wr
  1. Repeat steps 3-12 until all desired resources have been upgraded, then remove the flags and check Ceph:
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd unset nobackfill
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd unset norebalance
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd unset noout

$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph status
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 5h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 170 pgs
    objects: 9.07k objects, 31 GiB
    usage:   93 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     170 active+clean <-----------------------------------------------
 
  io:
    client:   2.2 KiB/s rd, 744 KiB/s wr, 2 op/s rd, 4 op/s wr

Root Cause

When expanding and scaling up (adding additional storage devices), or not meeting architecture requirements, pods can be left in a Pending state, not being able to request the desired resources from a node needed for scheduling.

Diagnostic Steps

$ oc describe node <node-name>

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                11459m (99%)   11612m (100%) <----- maxed out
  memory             28163Mi (91%)  31360Mi (101%) <---- need more


$ oc get pods -n openshift-storage | grep Pending
noobaa-core-0                                                     0/1     Pending     0          30m <---
noobaa-endpoint-fc58956d6-n4xdx                0/1     Pending     0          30m <---
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.