How to Redeploy rook-ceph-mon Resources in OpenShift Data Foundation (ODF)
Environment
- Red Hat OpenShift Container Platform (RHOCP) v4.x
- Red Hat OpenShift Data Foundation (RHODF) v4.x
Issue
There may be instances, such as an OSD migration to a new datastore/storageclass with PVC-backed rook-ceph-mon resources. For OSD migrations to a new datastore/storageclass where the old storageclass/datastore will need to be removed, this is a mandatory step. Not doing so, and removing the old storageclass while the mons are still backed by that storageclass can cause data loss.
Additionally, there may be inconsistencies between the rook-ceph-mon placement (scheduled nodes) and the rook-ceph-mon-endpoints configmap, where a redeployment of the rook-ceph-mon resource is needed.
Accomplishing the redeployment is safe and easy as long as the warning is adhered to, as this process will only scale the rook-ceph-mon resource down for enough time for Rook to reconcile the discrepancy.
Resolution
NOTE: The rook-ceph-operator pod MUST be running for the steps below to be successful.
**WARNING: This solution should begin with all three rook-ceph-mon resources in Quorum. Perform a rook-ceph-mon migration ONE MON AT A TIME, ONLY MOVING TO THE NEXT MON WHEN THE PREVIOUS MON JOINS QUORUM):
NOTE: For newly migrated OSD datastore/storageclass with PVC-backed rook-ceph-mon pods users ONLY, ensure the NEW storageclass is now shown in the dataPVCTemplate of the StorageCluster CR and/or is now reflected in the CephCluster CR:
$ oc get storagecluster -n openshift-storage ocs-storagecluster -o yaml | grep -A10 dataPVCTemplate
$ oc get cephcluster -n openshift-storage ocs-storagecluster-cephcluster -o yaml | grep storageClassName
PROCEDURE:
- Validate all three (or more) mons are in quorum:
$ oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph status -c /var/lib/rook/openshift-storage/openshift-storage.config
health: HEALTH_OK <------------------------ Ceph is Healthy
services:
mon: 3 daemons, quorum a,c,d (age 41s) <--- ALL THREE MONS ARE IN QUORUM, SAFE TO MIGRATE
mgr: a(active, since 5s), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 44h), 3 in (since 44h)
data:
volumes: 1/1 healthy
pools: 4 pools, 256 pgs
objects: 578 objects, 1.8 GiB
usage: 6.5 GiB used, 293 GiB / 300 GiB avail
pgs: 256 active+clean
- Backup the current rook-ceph-mon-endpoints configmap and any affected mon deployment(s):
$ oc get cm -n openshift-storage rook-ceph-mon-endpoints -o yaml > rook-ceph-mon-endpoints.yaml
$ oc get deployment -n openshift-storage rook-ceph-mon-<X> -o yaml > rook-ceph-mon-<X>.yaml
- Scale the first affected rook-ceph-mon down:
$ oc scale deployment -n openshift-storage rook-ceph-mon-X --replicas=0
- Wait ~10-15 minutes (700 seconds), and the rook-ceph-operator will deploy a new mon, likely using a new/different letter, and will remove the old deployment when conditions are met:
$ oc get deployment -n openshift-storage -l app=rook-ceph-mon
NAME READY UP-TO-DATE AVAILABLE AGE
rook-ceph-mon-a 1/1 1 1 62m
rook-ceph-mon-c 0/0 0 0 44h
rook-ceph-mon-d 1/1 1 1 22h
$ oc get deployment -n openshift-storage -l app=rook-ceph-mon
NAME READY UP-TO-DATE AVAILABLE AGE
rook-ceph-mon-a 1/1 1 1 67m
rook-ceph-mon-c 0/0 0 0 44h <--- old mon
rook-ceph-mon-d 1/1 1 1 22h
rook-ceph-mon-e 0/1 1 0 13s <--- new mon
$ oc get deployment -n openshift-storage -l app=rook-ceph-mon
NAME READY UP-TO-DATE AVAILABLE AGE
rook-ceph-mon-a 1/1 1 1 67m
rook-ceph-mon-d 1/1 1 1 22h
rook-ceph-mon-e 1/1 1 1 30s <---- Complete
- Verification steps before moving to the next rook-ceph-mon migration (if applicable):
$ oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph status -c /var/lib/rook/openshift-storage/openshift-storage.config
health: HEALTH_OK <------------------------ Ceph is Healthy
services:
mon: 3 daemons, quorum a,d,e (age 41s) <--- ALL THREE MONS ARE IN QUORUM, SAFE TO MIGRATE NEXT MON
mgr: a(active, since 5s), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 44h), 3 in (since 44h)
data:
volumes: 1/1 healthy
pools: 4 pools, 256 pgs
objects: 578 objects, 1.8 GiB
usage: 6.5 GiB used, 293 GiB / 300 GiB avail
pgs: 256 active+clean
- Repeat steps 3-5 until complete if additional rook-ceph-mon resources need to be redeployed.
NOTE: For newly migrated OSD datastore/storageclass with PVC-backed rook-ceph-mon pods users ONLY, ensure the NEW storageclass is now reflected on the rook-ceph-mon PVCs:
$ oc get pvc -n openshift-storage | grep rook-ceph-mon
Root Cause
Ensure PVC-backed rook-ceph-mon pods can still access storage and maintain quorum.
ODF versions bundled with Rook v1.15 or higher, a change was introduced in the ClusterController logic that strictly enforces nodeAffinity and podAntiAffinity, whereas previous versions were more permissive of placement violations. This can cause the rook-ceph-mon resources to enter a down state if not resolved prior to an upgrade.
Diagnostic Steps
Ensure the node column on the pods matches the data/mapping in the rook-ceph-mon-endpoints configmap:
NOTE: For PVC-backed monitors, mapping will reflect null.
$ oc get pods -n openshift-storage -l app=rook-ceph-mon -o wide
$ oc get cm -n openshift-storage rook-ceph-mon-endpoints -o yaml
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.