rook-ceph-osd-X Pod Stuck in CrashLoopBackOff/init after Node Reboot/OCP Upgrade monclient(hunting) - OpenShift Data Foundation
Environment
Red Hat OpenShift Container Platform (RHOCP) v4.x
Red Hat OpenShift Data Foundations (RHODF) v4.x
Red Hat OpenShift Container Storage (RHOCS) v4.x
Issue
In situations where the Local Storage Operator (LSO) cannot use /dev/disk/by-id/ symlinks, users will sometimes incorrectly configure their localvolume/localvolumediscovery objects to use the non-persistent /dev/sdN or /dev/vdN device files. When this occurs, the path in the PV will reflect /dev/sdX/dev/vdX. This can cause issues when the device file names change (Ex. /dev/sda changes to /dev/sdb) during a node reboot or an OCP upgrade. After which the PV can no longer communicate with the proper symlink located in the /mnt/local-storage/<storageclass>/ directory on the node.
If you are on vsphere and disk.EnableUUID is not set to true, running $ ls -l /dev/disk/by-id UUID will NOT display any UUIDs. When this is the case please refer to the KCS How to check and set the disk.EnableUUID parameter from VM in vSphere for OpenShift Container Platform
2024-06-21T17:10:24.362235635Z debug 2024-06-21T17:10:24.361+0000 7fc951f22700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
Related Articles:
Disk device (/dev/sda - /dev/sdb - /dev/sdc - /dev/sdd) changes after each node rebooting and impact pods using Local Storage Operator
Resolution
Preface:
It's important to understand that this is a workflow. Depending on the exact circumstances, there could be one OSD down, or many. If there are many OSDs down, a pod disruption budget or flags thrown by the rook-ceph-operator (nobackfill, norecover, etc.) have been instituted to safeguard data integrity.
In this workflow, since each node will have to be shutdown one at a time (while monitoring Ceph) to edit the VM, it will be important to initially get the OSDs up and running, allowing for rebalance, and for Ceph to return HEALTH_OK all PGs active+clean, then finally fix the UUIDs/Symlinks on each node.
Warning: Work one node at a time, DO NOT proceed to the next node, until ALL OSDs on the previous node are up/running. Losing too many OSDs could jeopardize data integrity (data loss).
- With a notepad tool, run the following commands and capture the data/begin sorting OSD IDs, PVs, PVCs, hostnames, and paths of the devices.
$ oc get pod -n openshift-storage -o 'custom-columns=NAME:.metadata.labels.ceph-osd-id,PVCNAME:.spec.volumes[*].persistentVolumeClaim' | grep -v none
Example:
NAME PVCNAME
0 map[claimName:ocs-deviceset-0-data-089lnk]
1 map[claimName:ocs-deviceset-0-data-1sr5dx]
2 map[claimName:ocs-deviceset-1-data-0nk5m5]
$ oc get pv -o 'custom-columns=HOST-NAME:.metadata.labels.kubernetes\.io/hostname,PVCNAME:.spec.claimRef.name,SYMLINK:.spec.local.path,DEV-NAME:.metadata.annotations.storage\.openshift\.com/device-name' | grep -v none
Example:
HOST-NAME PVCNAME SYMLINK DEV-NAME
node.1.com ocs-deviceset-0-data-1sr5dx /mnt/local-storage/localblock/sdb sdb
node.2.com ocs-deviceset-0-data-089lnk /mnt/local-storage/localblock/sdb sdb
node.3.com ocs-deviceset-1-data-0nk5m5 /mnt/local-storage/localblock/sdb sdb
- Set up two CLI Terminal windows for commands and monitoring.
NOTE: Depending on how many OSDs are provisioned on this node, this may be a trial/error task where a symlink command will be issued to correct the symlink followed by a pod deletion to test if the correct symlink is in place. It is also helpful to have three CLI terminal windows up (one in the current node, one to run oc commands, and one to monitor Ceph). To run Ceph commands, see the Configuring the Rook-Ceph Toolbox in OpenShift Data Foundation 4.x solution article.
Commands to monitor Ceph once rsh'd into the rook-ceph-tools pod:
$ ceph status #<------------- monitor PGs and OSDs
$ ceph osd tree $<------------ better look at the OSDs/hosts
Note: In a given scenario where some OSDs are up and some are down (shown in the outputs using the command above), we can determine that the OSDs that are up/in, have the CORRECT path in the PV! That device did not shift names.
Example of an OSD that is up/running:
$ oc get pods -n openshift-storage | grep rook-ceph-osd | grep Running
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-567869c6c4-5b56t 2/2 Running 0 76m
$ oc describe pod -n openshift-storage rook-ceph-osd-0-567869c6c4-5b56t | grep -i claim
ClaimName: ocs-deviceset-0-data-089lnk
$ oc get pv -n openshift-storage | grep ocs-deviceset-0-data-089lnk
local-pv-9a68ad40 1TiB RWO Delete Bound openshift-storage/ocs-deviceset-0-data-089lnk localblock
$ oc describe pv -n openshift-storage local-pv-9a68ad40 | grep -i path
Path: /mnt/local-storage/localblock/sdc <--- WE CAN RULE THIS OUT. THE OSD IS UP/RUNNING. THE SYMLINK TO /dev/sdc IS CORRECT (DON'T TOUCH).
- Debug into the first node and correct the symlink(s) on the first node.
$ oc debug node/node.1.com
$ chroot /host
- Depending on how many devices are on the host will make this easier/more difficult, however, a good command to run to at least start to distinguish the devices is
$ lsblk.
Note: OSD devices will likely be the devices with no underlying/nested partitions and will usually (not always) be larger devices such as 500GiB, 1TiB, 2TiB, etc.
Example:
$ lsblk
sda 259:5 0 1T 0 disk <------ device shifted from sdb to sda
- Once it's been determined where the device went/may have gone, navigate to the LSO symlink folder.
$ cd /mnt/local-storage/<storageclass-name>/
$ ls -ltr
sh-5.1# ls -ltr
lrwxrwxrwx. 1 root root 68 Apr 26 18:32 sdb -> /dev/sdb <------------- symlink
- Now that we have confirmed the device shifted to
/dev/sdain our example above, attempt to correct the symlink.
$ cd /mnt/local-storage/<storageclass-name>/
$ ln -sf /dev/sda sdb
- Delete the pod.
$ oc delete pod -n openshift-storage rook-ceph-osd-X-<pod-name>
NOTE: If the pod comes up/running the symlink correction was successful. However, as discussed previously, if there are many OSDs on the host this may transition to a trial/error process. If this is the case repeat steps 4-7 until success is finally achieved.
-
Once all OSDs are up/running on the first node, proceed to the next node and repeat steps 3-7, and for the remaining nodes, only proceeding to the next node when all OSDs on the current node being worked are up/Running.
-
After all OSDs on all the Nodes are running, monitor the rebalance in Ceph and don't proceed to bring down the nodes to enable UUID until all PGs are
active+clean.
$ oc rsh -n openshift-storage rook-ceph-tools-<pod-name>
$ ceph status
health: HEALTH_OK <------------------------------------- Needs to be before bringing down the first node
services:
mon: 3 daemons, quorum a,b,c (age 84m)
mgr: b(active, since 84m), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 84m), 6 in (since 84m)
data:
volumes: 1/1 healthy
pools: 4 pools, 209 pgs
objects: 99 objects, 154 MiB
usage: 421 MiB used, 600 GiB / 600 GiB avail
pgs: 209 active+clean <---------------------------- Needs to be before bringing down the first node
io:
client: 1.2 KiB/s rd, 2.7 KiB/s wr, 2 op/s rd, 0 op/s wr
-
Once rebalancing is finished (Step 9), begin enabling
disk.EnableUUID=TRUEanduuid.action=keepon each host, one at a time, and only proceed to the next node once Ceph reflects the example in Step 9. To gracefully bring down a storage node, follow the steps to scale down any mon/osd on the node being worked on. Additionally, delete any noobaa pod on that node once it's been cordoned. See How to safely reboot an OCS/ODF 4 node solution for more info. Lastly, DO NOT forget to scale up the mons/OSDs once the node has been uncordoned and inReadystatus. -
The following documentation will assist in this process (Since this is Infrastructure Specific, relevant VMWare docs will be provided, however it is up to the user to pursue the most up-to-date documentation regarding their virtual environment):
Use of VMware VMotion with OpenShift Container Storage / OpenShift Data Foundation
-
Once UUIDs have been enabled ON THE FIRST NODE (devices may shift again), you can now run
$ ls -l /dev/disk/by-idin the node and begin seeing UUIDs. Both scsi UUIDs and wwn UUIDs should be present. Although scsi IDs were used frequently in prior versions of LSO. The new UUIDs LSO defaults to are the new wwn UUIDs. Although either will work, when fixing symlinks use the wwn UUIDs. -
To begin correcting the symlinks to the UUID, this may be a trial and error process again, however, a helpful command in Ceph that may reveal which UUID is associated with which OSD, is the following command:
$ oc rsh -n openshift-storage rook-ceph-tools-<pod-name>
Example for osd.0. Change ID number based on desired OSD:
$ ceph osd metadata 0
<omitted-for-space>
"device_ids": "wwn-0x6000c29a28e6d9e51bf43c386e40d6b9", <---- does this have a UUID? If not continue with trial/error (steps 3-7)
- Correct the symlink.
Example:
$ oc debug node/node.1.com
$ chroot /host
$ lsblk
$ ls -l /dev/disk/by-id
$ cd /mnt/local-storage/<storageclass-name>/
ln -sf /dev/disk/by-id/wwn-0x6000c29a28e6d9e51bf43c386e40d6b9 sdb
- Delete the pod.
$ oc delete pod -n openshift-storage rook-ceph-osd-X-<pod-name>
NOTE: If the pod comes up/running the final symlink correction was successful. Once corrected on all OSDs, the node will finally be able to survive reboots. Repeat on remaining nodes while monitoring Ceph.
Root Cause
Without UUID enabled in virtualized environment LSO defaults to using device IDs. For example /dev/sdb, /dev/sdc/, etc. which can shift upon node reboots.
Diagnostic Steps
$ oc get get pods -n openshift-storage | grep rook-ceph-osd
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-bd9dd975f-bt2k6 2/2 Running 6 99d
rook-ceph-osd-1-64d4b849f4-llmhz 1/2 CrashLoopBackOff 13 2h
rook-ceph-osd-2-77c6965975-sdx5b 2/2 Running 0 46m
rook-ceph-osd-3-67665f9d4-h9d75 1/2 CrashLoopBackOff 15 2h
rook-ceph-osd-4-f985555f9-nf2pg 2/2 Running 4 99d
rook-ceph-osd-5-5748b4cdc-tqvmh 2/2 Running 4 99d
$ oc logs -n openshift-storage -f rook-ceph-osd-1-64d4b849f4-llmhz
2024-06-21T17:10:24.362235635Z debug 2024-06-21T17:10:24.361+0000 7fc951f22700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
$ oc describe pod -n openshift-storage rook-ceph-osd-1-64d4b849f4-llmhz | grep -i claim
ClaimName: ocs-deviceset-0-data-089lnk
$ oc get pv -n openshift-storage | grep ocs-deviceset-0-data-089lnk
local-pv-9a68ad40 1TiB RWO Delete Bound openshift-storage/ocs-deviceset-0-data-089lnk localblock
$ oc describe pv -n openshift-storage local-pv-9a68ad40 | grep -i path
Path: /mnt/local-storage/localblock/sdc <------------------ Not UUID
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.