OpenShift Data Foundation Disaster Recovery misconfigured after upgrade from ODF v4.17.z to ODF v4.18.0

Solution Verified - Updated

Environment

  • ODF Multicluster Orchestrator v4.18.0
  • Openshift DR Hub Operator v4.18.0
  • Openshift Data Foundation v4.18.0 - Internal Mode only
  • Regional DR for Ceph RBD based storageclasses

Issue

When ODF Multicluster Orchestrator and Openshift DR Hub Operator are upgraded from 4.17.z to 4.18.0, some of the Disaster Recovery resources are misconfigured in case of internal mode ODF deployment. This impacts Disaster Recovery of workloads using "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-ceph-rbd-virtualization" StorageClasses.

Resolution

  1. On ACM hub cluster, Wait for "odfmo-controller-manager" deployment in "openshift-operators" namespace to be upgraded and running.
$ oc get deploy -n openshift-operators -l control-plane=odfmo-controller-manager

$ oc get po -n openshift-operators -l control-plane=odfmo-controller-manager
  1. On ACM ManagedClusters, wait for "token-exchange-agent" deployment in "openshift-storage" namespace to be upgraded and running.
$ oc get deploy -n openshift-storage -l app=token-exchange-agent

$ oc get pod -n openshift-storage -l app=token-exchange-agent
  1. Once all the upgrades are completed, restart the deployment from step 1.
$ oc scale deployment -n openshift-operators -l=control-plane=odfmo-controller-manager --replicas=0
deployment.apps/odfmo-controller-manager scaled

$ oc scale deployment -n openshift-operators -l=control-plane=odfmo-controller-manager --replicas=1
deployment.apps/odfmo-controller-manager scaled
  1. On ACM hub, wait for 2 minutes for the deployment to reconcile configuration.
$ oc get pods -n openshift-operators -l control-plane=odfmo-controller-manager
  1. On ACM hub, get the peerClasses from DRPolicy.
$ oc get drpolicy -ojsonpath='{.items[*].status.async.peerClasses}' | jq '.[]' | jq -s | yq -P
- clusterIDs:
    - 41a21f36-3dff-40c7-8350-eb7d3640222c
    - 09ea103c-58d1-4981-9289-06c350d94d4e
  storageClassName: ocs-storagecluster-cephfs
  storageID:
    - 350ae6729fc893fe4ae51447385839bd
    - c88ad8ac880c5f0bc064060ac6508a89
- clusterIDs:
    - 41a21f36-3dff-40c7-8350-eb7d3640222c
    - 09ea103c-58d1-4981-9289-06c350d94d4e
  replicationID: 8c092570dcaebf7ef780283fbf2bcd1a
  storageClassName: ocs-storagecluster-ceph-rbd
  storageID:
    - fc858d18de2dc749b718b6e88e291586
    - 75a882159e715101111dd54e23366a29
- clusterIDs:
    - 41a21f36-3dff-40c7-8350-eb7d3640222c
    - 09ea103c-58d1-4981-9289-06c350d94d4e
  replicationID: 8c092570dcaebf7ef780283fbf2bcd1a
  storageClassName: ocs-storagecluster-ceph-rbd-virtualization
  storageID:
    - fc858d18de2dc749b718b6e88e291586
    - 75a882159e715101111dd54e23366a29
  1. On ACM hub, patch the manifestwork for problematic VRG with the peerClasses from above step.
$ oc get manifestwork -A | grep vrg-mw
c1k     app-busybox-cephfs-1-placement-drpc-app-busybox-cephfs-1-vrg-mw   5d14h
c1k     app-busybox-cephfs-2-placement-drpc-app-busybox-cephfs-2-vrg-mw   5d14h
c1k     app-busybox-rbd-2-placement-drpc-app-busybox-rbd-2-vrg-mw         5d14h
c1k     imp-app-1-openshift-dr-ops-vrg-mw                                 5d4h
c1k     imp-app-2-openshift-dr-ops-vrg-mw                                 5d15h
c1k     imp-app-3-6-7-openshift-dr-ops-vrg-mw                             5d4h
c1k     imp-app-4-5-openshift-dr-ops-vrg-mw                               5d4h
c2k     app-busybox-cephfs-1-placement-drpc-app-busybox-cephfs-1-vrg-mw   5d14h
c2k     app-busybox-cephfs-2-placement-drpc-app-busybox-cephfs-2-vrg-mw   5d14h
c2k     app-busybox-rbd-2-placement-drpc-app-busybox-rbd-2-vrg-mw         5d4h
c2k     imp-app-1-openshift-dr-ops-vrg-mw                                 5d15h
c2k     imp-app-2-openshift-dr-ops-vrg-mw                                 5d4h
c2k     imp-app-3-6-7-openshift-dr-ops-vrg-mw                             5d15h
c2k     imp-app-4-5-openshift-dr-ops-vrg-mw                               5d7h
$ oc edit manifestwork imp-app-2-openshift-dr-ops-vrg-mw -n c1k

Copy the value from step 5, and replace the value under "peerClasses:" in the manifestwork.

Root Cause

When ODF Multicluster Orchestrator is upgraded to v4.18.0, it creates a deployment on ACM hub and managed clusters. Deployment on hub is started before the deployments on all managed clusters.
But, hub needs updated data from the managed clusters to update DR configuration. This causes a race condition in the operator, causing DR configuration to be updated with stale data. This configuration isn't updated again even when fresh data is available.

Diagnostic Steps

  • On each ManagedCluster which has DR enabled, run the following commands, and verify the output.
$ oc get volumereplicationclass -o yaml | grep -i storageid
ramendr.openshift.io/storageid: 7f68704cb3e5484e9a82d405ae95f968
ramendr.openshift.io/storageid: 7f68704cb3e5484e9a82d405ae95f968
ramendr.openshift.io/storageid: 7f68704cb3e5484e9a82d405ae95f968
  • For each different "storageid" in volumereplicationclass, there should be a corresponding "storageid" in rbd storageclasses.
$ oc get storageclass ocs-storagecluster-ceph-rbd ocs-storagecluster-ceph-rbd-virtualization -o yaml | grep -i storageid
ramendr.openshift.io/storageid: fc858d18de2dc749b718b6e88e291586
ramendr.openshift.io/storageid: fc858d18de2dc749b718b6e88e291586
  • Here the entries do not match. It means this cluster is impacted.

  • Additionally, you could also inspect the VRGs.

$ oc get vrg -A
NAMESPACE              NAME                                  DESIREDSTATE   CURRENTSTATE
app-busybox-cephfs-1   app-busybox-cephfs-1-placement-drpc   primary        Primary
app-busybox-cephfs-2   app-busybox-cephfs-2-placement-drpc   primary        Primary
app-busybox-rbd-2      app-busybox-rbd-2-placement-drpc      primary        Primary
openshift-dr-ops       imp-app-1                             secondary      Secondary
openshift-dr-ops       imp-app-2                             primary        Primary
openshift-dr-ops       imp-app-3-6-7                         secondary      Secondary
openshift-dr-ops       imp-app-4-5                           secondary      Secondary
  • If it has events like below, this cluster is impacted.
$ oc describe vrg imp-app-2 -n openshift-dr-ops
Events:
  Type     Reason            Age                 From                               Message
  ----     ------            ----                ----                               -------
  Warning  FailedValidation  35s (x572 over 8h)  controller_VolumeReplicationGroup  storageID mismatch between peerClass (7f68704cb3e5484e9a82d405ae95f968) and StorageClass (75a882159e715101111dd54e23366a29)
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.