How to Safely Increase Resources on OpenShift Data Foundation (ODF) Storage Nodes
Environment
Red Hat OpenShift Container Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Red Hat Ceph Storage (RHCS) 8.x
Issue
-
ODF Storage nodes have specific architecture requirements defined in section 7.3. Resource requirements of the Product Documentation.
-
In some instances, especially after ODF storage devices/disks have been scaled up (added), users may experience pods going into a
Pendingstatus and not being able to schedule pods. This is because with each ODF device added to a storage node, an additional 2 CPU and 5GiB of memory will be requested from the node regardless of the device size. -
Although there are many different infrastructure types (AWS, VMWare, Bare Metal, Azure, etc.), the process is typically the same. Generic terminology will be used in this solution, however, infrastructure administrators will need to fill in the gaps on how it relates to the specific platform being administered.
Resolution
Prerequisite: For ODF storage nodes, please ensure the rook-ceph-tools pod is enabled by following the Configuring the Rook-Ceph Toolbox in OpenShift Data Foundation 4.x solution.
WARNING: When performing multi-host/node shutdowns/reboots, it is extremely important to view the status of Ceph before/after each host shutdown/reboot and ONLY proceed to the next host when all PGs are active+clean, scrubbing is ok.
- Capture the Status of Ceph:
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph status
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 2d)
mgr: a(active, since 4h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 170 pgs
objects: 8.22k objects, 28 GiB
usage: 84 GiB used, 1.4 TiB / 1.5 TiB avail
pgs: 170 active+clean <------------------------------------------------ PGs need to be active+clean, scrubbing is ok as well
io:
client: 1.5 KiB/s rd, 946 KiB/s wr, 2 op/s rd, 4 op/s wr
NOTE: The functionality of Ceph is to always keep replicated copies of data spread across storage nodes (typically 3 copies is the default). If a host is shut down in a cluster with more than three disks, Ceph will begin moving the data elsewhere. During this process, the PGs will reflect active+undersized+degraded, which means the third copy of the data is missing. Because we're only bringing the storage hosts down temporarily to make the resource change, we want to avoid the "self-healing" process backfilling, remapping, etc. This is to save time, as once the host is brought back up, Ceph will recalculate the data placement and begin moving data back to the original host. To prevent the unnecessary waste of time and I/O, we will temporarily put flags in place to prevent data movement between host shutdowns. This is perfectly safe as long as one host at a time is being worked, and Ceph health is inspected in between shutdowns.
- Set the following Ceph flags to avoid PG movement:
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd set nobackfill
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd set norebalance
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd set noout
- Cordon the node to be shut down:
$ oc adm cordon <node-name>
- After the node has been cordoned, find the OSDs and mons on the host to be shut down, and scale down those resources:
$ oc get pods -owide -n openshift-storage |grep <node-name>|egrep 'osd|mon'
$ oc get deployment -n openshift-storage
$ oc scale deployment <deployment-name> <deployment-name> <deployment-name> --replicas=0 -n openshift-storage
NOTE: Some PGs should now reflect undersized+degraded.
- Search for any NooBaa pod scheduled on the node to be shut down, and gracefully delete the pod so the noobaa pod can be scheduled elsewhere:
NOTE: If there are no NooBaa pods scheduled on the node, skip to step 6; however, if there is one, delete the pod. If there are multiple, delete the noobaa pods in the following order:
- nooba-operator
- nooba-core
- nooba-endpoint
- noobaa-db
$ oc get pods -o wide -n openshift-storage|grep <node-name>
$ oc delete pod <pod> -n openshift-storage
- Drain the node:
$ oc adm drain <NODENAME> --delete-emptydir-data --grace-period=1 --ignore-daemonsets
- Once the node drain message shows the drain was successful, shutdown the node/host (there are many ways to do this, below is an example):
$ oc debug node/<node-name>
$ chroot /host
$ shutdown now
- When the node is in a
NotReady, SchedulingDisabledstate, increase the resources on the host. For example:
- Change the AWS/Azure Instance.
- Edit the CPU/Memory of the VM in the VMWare Console.
- Add hardware to Bare Metal.
- Power on the node, once the node becomes
Ready, SchedulingDisabled, uncordon the node:
$ oc adm uncordon <node-name>
- Describe the node to validate that the changes are updated on the node:
$ oc describe node <node-name>
Capacity:
cpu: 12 <------------------------------------------------------
ephemeral-storage: 125293548Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 33554432Ki <------------------------------------
- Scale up the previously scaled-down mon and OSD deployments:
$ oc scale deployment <deployment-name> <deployment-name> <deployment-name> --replicas=1 -n openshift-storage
- Monitor Ceph:
WARNING: Do not shutdown the next node until all PGs are active+clean, scrubbing is ok.
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph status
health: HEALTH_WARN
noout,nobackfill,norebalance flag(s) set
services:
mon: 3 daemons, quorum a,b,c (age 2d)
mgr: a(active, since 5h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
flags noout,nobackfill,norebalance
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 170 pgs
objects: 8.78k objects, 30 GiB
usage: 90 GiB used, 1.4 TiB / 1.5 TiB avail
pgs: 170 active+clean <------------------------------------------------ PGs need to be active+clean, scrubbing is ok as well
io:
client: 11 KiB/s rd, 3.7 MiB/s wr, 10 op/s rd, 8 op/s wr
- Repeat steps 3-12 until all desired resources have been upgraded, then remove the flags and check Ceph:
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd unset nobackfill
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd unset norebalance
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph osd unset noout
$ oc exec -n openshift-storage deployment/rook-ceph-tools -- ceph status
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 2d)
mgr: a(active, since 5h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 2d), 3 in (since 2d)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 170 pgs
objects: 9.07k objects, 31 GiB
usage: 93 GiB used, 1.4 TiB / 1.5 TiB avail
pgs: 170 active+clean <-----------------------------------------------
io:
client: 2.2 KiB/s rd, 744 KiB/s wr, 2 op/s rd, 4 op/s wr
Root Cause
When expanding and scaling up (adding additional storage devices), or not meeting architecture requirements, pods can be left in a Pending state, not being able to request the desired resources from a node needed for scheduling.
Diagnostic Steps
$ oc describe node <node-name>
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 11459m (99%) 11612m (100%) <----- maxed out
memory 28163Mi (91%) 31360Mi (101%) <---- need more
$ oc get pods -n openshift-storage | grep Pending
noobaa-core-0 0/1 Pending 0 30m <---
noobaa-endpoint-fc58956d6-n4xdx 0/1 Pending 0 30m <---
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.