DG Operator 8 Recovering from misbehaved upgrade
Environment
- Red hat OpenShift Container Platform (OCP)
- 4.x
- Red Hat Data Grid (RHDG)
- 8.x
- Operator
Issue
How to recover from misbehaved upgrade on Data Grid 8 Operator?
Resolution
In case the DG Operator upgrade misbehaves see below;
Users can follow How to upgrade from DG 8.1/8.2 to DG 8.3 in DG 8 Operator for the procedure for DG upgrade.
For the DG server upgrade, if using DG Operator version 8.4.x the user can set the version by changing the spec.image in the Infinispan CR, as explained on Data Grid Operator 8.4.x spec.version feature.
In case the DG Operator misbehaves, see the explanation below:
Recovering from misbehaved upgrade [Important]
The upgrade process is done by the OLM - see below - not by the DG operator itself. In a few cases, for a variety of reasons, the upgrade might misbehave and a resource might be lost (cache cr, infinispan cr) or a unstable state might occur.
To avoid this, it is very helpful to get a backup - via Backup CR or via cli command. The backup will have the cache/cache entries and can be applied via Restore (cli command or Restore CR).
To recover from those misbehaved upgrades, it might be required to delete the CRs, the project (operator's project and infinispan's cr projects), the operator.
And finally the CRDs (which user fetch via oc api-resources -o wide | grep infinispan). The CRDs are cluster wide resources, not namespace bounded, installed by the operator when on the upgrade.
And then install the Dg Operator's target version.
Procedure:
- set replicas to 0 on the infinispan cr
- delete Infispan/Cache crs
- delete operator yaml:
$ oc delete operator <old-operator> - delete ns: oc delete project $projectname
- delete crd: see below
Example:
###
### To list CRDs
$ oc get crd | grep infinispan
backups.infinispan.org 2023-08-10T16:23:20Z
batches.infinispan.org 2023-08-10T16:23:20Z
caches.infinispan.org 2023-08-10T16:23:20Z
infinispans.infinispan.org 2023-08-10T16:23:20Z
restores.infinispan.org 2023-08-10T16:23:20Z
###
### To delete:
$ oc delete crd backups.infinispan.org
customresourcedefinition.apiextensions.k8s.io "backups.infinispan.org" deleted
$ oc delete crd batches.infinispan.org
customresourcedefinition.apiextensions.k8s.io "batches.infinispan.org" deleted
$ oc delete crd restores.infinispan.org
customresourcedefinition.apiextensions.k8s.io "restores.infinispan.org" deleted
$ oc delete crd caches.infinispan.org
customresourcedefinition.apiextensions.k8s.io "caches.infinispan.org" deleted
$ oc delete crd infinispans.infinispan.org
customresourcedefinition.apiextensions.k8s.io "infinispans.infinispan.org" deleted
In case the Cache/Infinispan hangs/get stuck, edit the metadata.finalizers respective section.
Sometimes caches and infinispan cr can get stuck and this might be required to delete the respective CRD and namespace.
In this case:
$ oc edit cache dg-cluster-nyc-operator-cache-03 <--- remove the metadata.finalizer section
cache.infinispan.org/dg-cluster-nyc-operator-cache-03 edited
CRD deletion
It is not enough to just delete the operator itself - the CRDs will still stay on the /etcd.
In case the project gets stuck, see DG project does not get deleted.
In case a PVC migration is required, see Migrate persistent data to another Storage Class in DG 8 Operator in OCP 4. And more information about the PV, here: Reading DG 8 PV Data.
Webhook issues
In case the webhook (or its configuration) misbehaves, the user can delete the webhooks and re-install the DG Operator so then the OLM/CO will re-install the webhooks. See solution DataGrid 8 Operator Webhooks in OCP 4.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.