Creating an Automated OADP Backup for OpenShift Data Foundation (NooBaa-DB)
Environment
Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.18 and below
Red Hat OpenShift API for Data Protection (OADP) 1.4.1+
Issue
Preface:
Because the ODF Operator and included resources are supported by the ODF team and the OADP Operator is supported by the Shift Storage team, the author of this solution collaborated with various teams/colleagues to produce this solution.
This solution was tested on OCP v4.17.17 and ODF v4.17.5 by storing objects in buckets, backing up with the OADP application, and deleting the db-noobaa-db-pg-0 PVC. This caused NooBaa to enter a Configuring phase and resulted in a state of not being able to access the bucket contents via s3. Once the db was restored, the objects were once again accessible, and read/write operations for both new/old objects was made possible again.
time="2025-01-23T19:46:45Z" level=error msg="⚠️ RPC: system.read_system() Response Error: Code=UNAUTHORIZED Message=account not found 661ff9fa3e37850029df061a"
Accidental db-noobaa-db-pg-0 PVC deletions are one of the most common instances of data loss regarding NooBaa. Once NooBaa entered this state, a restore was created in OADP using the backup ID. Performing the steps in this solution resulted in a successful restoration of NooBaa back to a Ready phase.
Best Practices:
The best practice for storing ODF/NooBaa (MCG) backups is to store them OUTSIDE of the cluster. For example, a NooBaa bucket in a different OCP cluster can store the backups or an AWS s3 bucket. Because not all customers have access to external environments for s3 backup storage (separate cluster, cloud/AWS), or if they're in a disconnected environment this solution was written to use Ceph's internal Rados Gateway (RGW) to store the backups.
Although this is not the best practice to use an internal object service for backup storage, the vast majority of observed data losses reveal that if the backups were simply stored in an internal s3 solution such RGW because an external solution was not available, a restore would've been possible. It's not recommended to deviate from best practices, however, all customers (including disconnected environments) have access to RGW, hence the reason why this solution is showing RGW (greater reach).
In Bare Metal/VM environments, RGW is created by default during the creation of the storagesystem/storagecluster. In other environments where the cephobjectstore (RGW) resource is not created, it can be manually created. See Content from red-hat-storage.github.io is not included.Enable use of the RGW on an OCS internal deployment for more information.
If AWS/an s3 compatible storage is available, just substitute the bucket name, external route (endpoint), access, and access secret key to reflect the backup destination bucket in the DataProtectionApplication YAML.
Resolution
DISCLAIMER: It will be the user's responsibility to perform their OWN research regarding the configuration of custom/tailored OpenShift backups based on user requirements in addition to testing the restore. This solution only serves as a template to assist in the OADP backup process. Red Hat accepts NO responsibility regarding the effectiveness of this solution. Additionally, many variables can be considered in regard to backing up with OADP. Below will consist of a configuration for backing up the NooBaa application, NooBaa-DB, and many other resources in the openshift-storage namespace. A portion of the information below has been derived from the OADP Backing up applications product documentation for additional guidance. Lastly, please consider backup space and backup retention/number of copies when planning your backups.
Configuration:
-
Install the latest OADP Operator. By default, the operator creates and installs in the
openshift-adpnamespace. However, it is possible to have the operator installed in multiple/separate namespaces if wanting to create multiple instances. -
Create your s3 Object Bucket Claim (OBC) for backups to be stored. To re-iterate, best practices state that the backups should be stored outside of the cluster. See your respective s3 bucket/cloud storage provider's documentation for s3 bucket creation. For this solution, we'll be utilizing Ceph's Rados Gateway (RGW) which all customers have access to. Follow section 9.3. Creating an Object Bucket Claim using the OpenShift Web Console, name the OBC, and be sure to select storageclass
ocs-storagecluster-ceph-rgw. -
After the OBC is created, scroll down on the OBC page and you will see hidden values that can be revealed if needed. These values contain the target bucket name (long name, NOT obc name), access/secret access key, and endpoint. The endpoint will be the internal service endpoint, however, for external s3 buckets you'll need the external route/endpoint. For internal mode clusters you can use the internal RGW service or external RGW route. For external mode clusters, you will need the external route. All four items will be needed to create the data application.
-
Using a text editor, create a secret named
credentials.yamlwith the access and secret access key in the namespace the OADP operator is installed in.
a. Create the file.
credentials.yaml file contents:
[default]
aws_access_key_id=<ACCESS_KEY_ID>
aws_secret_access_key=<SECRET_ACCESS_KEY>
b. Create the secret:
$ oc create secret generic cloud-credentials -n openshift-adp --from-file cloud=credentials.yaml
- In the OCP Console, navigate to Operators -> Installed Operators -> OADP Operator and click "Create Instance" in the DataProtectionApplication (DPA) tile. Click
YAML view, and input the following contents:
Common Fields to Change: name, bucket, s3Url, credential <--(name matches secret)
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: velero
namespace: openshift-adp
spec:
configuration:
nodeAgent:
enable: true
uploaderType: kopia
velero:
defaultPlugins:
- openshift
- aws
- kubevirt
snapshotLocations:
- velero:
config:
profile: default
region: us-east-1
provider: aws
backupLocations:
- velero:
config:
insecureSkipTLSVerify: 'true'
profile: default
region: us-east-1
s3ForcePathStyle: 'true'
s3Url: 'http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80'
credential:
key: cloud
name: cloud-credentials
objectStorage:
bucket: rgw-long-bucket-name-xxxxx-xxxx-xxxx
prefix: velero
default: true
provider: aws
NOTE: Once the application is created you should see Condition: Reconciled, this indicates success. The BackupStorageLocation and VolumeSnapshotLocation tabs will not be populated with a name. You will need that name for the backup. In the example above, the name of the application was velero, which means the BackupStorageLocation and VolumeSnapshotLocation will be named velero-1.
Backup:
- Create a backup. In the OCP Console, navigate to Operators -> Installed Operators -> OADP Operator and click "Create Instance" in the Backup (B) tile. Click
YAML view, and input the following contents:
Common Fields to Change: volumeSnapshotLocations, storageLocation
One-Time Backups:
apiVersion: velero.io/v1
kind: Backup
metadata:
generateName: openshift-storage-backup-
labels:
app: openshift-storage
component: noobaa-db
namespace: openshift-adp
spec:
csiSnapshotTimeout: 10m0s
datamover: kopia
defaultVolumesToFsBackup: true
hooks:
resources:
- includedNamespaces:
- openshift-storage
labelSelector:
matchExpressions:
- key: noobaa-db
operator: In
values:
- postgres
name: pg_dump
pre:
- exec:
command:
- mkdir
- -p
- /var/lib/pgsql/data/backup
container: db
onError: Continue
timeout: 10s
- exec:
command:
- rm
- -f
- /var/lib/pgsql/data/backup/pg_dumpall.gz
container: db
onError: Continue
timeout: 10s
- exec:
command:
- /bin/bash
- -c
- /usr/bin/pg_dumpall -U postgres -c | /usr/bin/gzip -9 -f > /var/lib/pgsql/data/backup/pg_dumpall.gz
container: db
onError: Fail
timeout: 1800s
includedNamespaces:
- openshift-storage
itemOperationTimeout: 0h5m0s
metadata: {}
snapshotMoveData: true
storageLocation: velero-1
ttl: 24h0m0s
volumeSnapshotLocations:
- velero-1
Scheduled Backups, Example Shows Everyday at 2am (change as needed):
apiVersion: velero.io/v1
kind: Schedule
metadata:
annotations: {}
name: daily-openshift-storage-backup
namespace: openshift-adp
labels:
app: openshift-storage
component: noobaa-db
spec:
schedule: '0 2 * * * '
skipImmediately: false
template:
csiSnapshotTimeout: 10m0s
datamover: kopia
defaultVolumesToFsBackup: true
hooks:
resources:
- includedNamespaces:
- openshift-storage
labelSelector:
matchExpressions:
- key: noobaa-db
operator: In
values:
- postgres
name: pg_dump
pre:
- exec:
command:
- mkdir
- -p
- /var/lib/pgsql/data/backup
container: db
onError: Continue
timeout: 10s
- exec:
command:
- rm
- -f
- /var/lib/pgsql/data/backup/pg_dumpall.gz
container: db
onError: Continue
timeout: 10s
- exec:
command:
- /bin/bash
- -c
- /usr/bin/pg_dumpall -U postgres -c | /usr/bin/gzip -9 -f > /var/lib/pgsql/data/backup/pg_dumpall.gz
container: db
onError: Fail
timeout: 1800s # 30m
includedNamespaces:
- openshift-storage
itemOperationTimeout: 0h30m0s
snapshotMoveData: true
storageLocation: velero-1
ttl: 78h0m0s
volumeSnapshotLocations:
- velero-1
Once the backup is created it will progress and you should see a green checkmark when it's complete. Each backup will have it's own backup ID which will be needed for restore.
Restore:
When creating a restore in OADP it will validate the backup contents to the current contents in the openshift-storage namespace and patch/restore what is missing. The restore below will work in many scenarios, however, the particular restore was tested using the most common cause of data loss using NooBaa which is accidental PVC deletion. In testing, the db-noobaa-db-pg-0 PVC was deleted along with the pod as well. This freed up the volume mount. Once the noobaa-db-pg-0 pod came back up, NooBaa reconciles by creating a new PVC which will be missing the old data and all access to previous OBCs has now been lost.
- In order to successfully restore NooBaa, particularly
noobaa-db-pg-0and its database contents, the following prerequisites must be done.
Warning: It is not recommended to test this in production. Please test in a sandbox (lab). Only accomplish these steps if your data has been validated as missing and needs restoration. If in doubt, follow the steps in the Backup and Restore for Multicloud Object Gateway database (NooBaa DB) solution prior to executing the steps below in the event the db was still useful, but there was another/separate issue at play.
a. Scale down the NooBaa stack:
$ oc -n openshift-storage scale --replicas=0 deploy/noobaa-operator deploy/noobaa-endpoint sts/noobaa-db-pg sts/noobaa-core
b. Delete the current noobaa-db-pg-0 pod PVC.
$ oc -n openshift-storage delete pvc -l noobaa-db=postgres
c. Delete the noobaa-db-pg-0 pod statefulset:
$ oc -n openshift-storage delete statefulset/noobaa-db-pg
d. Validate whether the noobaa-operator-lock configmap is present or not. If it is present, delete it. This will most likely reappear during the restore and will need to be deleted again.
$ oc -n openshift-storage get cm noobaa-operator-lock
$ oc -n openshift-storage delete cm noobaa-operator-lock
- Create a restore. In the OCP Console, navigate to Operators -> Installed Operators -> OADP Operator and click "Create Instance" in the Restore (R) tile. Click
YAML view, and input the following contents:
Common Fields to Change: backupName <--(needs to match backup ID that will be used)
apiVersion: velero.io/v1
kind: Restore
metadata:
generateName: openshift-storage-restore-
namespace: openshift-adp
labels:
app: openshift-storage
component: noobaa-db
spec:
excludedResources:
- events.events.k8s.io
- backingstores.noobaa.io
- bucketclasses.noobaa.io
- cephblockpoolradosnamespaces.ceph.rook.io
- cephblockpools.ceph.rook.io
- ephbucketnotifications.ceph.rook.io
- cephbuckettopics.ceph.rook.io
- cephclients.ceph.rook.io
- cephclusters.ceph.rook.io
- cephcosidrivers.ceph.rook.io
- cephfilesystemmirrors.ceph.rook.io
- cephfilesystems.ceph.rook.io
- cephfilesystemsubvolumegroups.ceph.rook.io
- cephnfses.ceph.rook.io
- cephobjectrealms.ceph.rook.io
- cephobjectstores.ceph.rook.io
- cephobjectstoreusers.ceph.rook.io
- cephobjectzonegroups.ceph.rook.io
- cephobjectzones.ceph.rook.io
- cephrbdmirrors.ceph.rook.io
- ceph.rook.io
- namespacestores.noobaa.io
- noobaaaccounts.noobaa.io
- noobaas.noobaa.io
- objectbuckets.objectbucket.io
- storagesystem.odf.openshift.io
- storagecluster.ocs.openshift.io
- csiaddons.openshift.io
- deployment.apps
- rbac.authorization.k8s.io
- RoleBinding
- rolebindings
- Role
- roles
- Route
- events
- jobs
- Lease
- DaemonSet
- ControllerRevision
- ClusterServiceVersion
- ControllerRevision
- Endpoints
- EndpointSlice
- HorizontalPodAutoscaler
- OCSInitialization
- OperatorCondition
- OperatorGroup
- ReclaimSpaceJob
- ReplicaSet
- authorization.openshift.io
hooks:
resources:
- includedNamespaces:
- openshift-storage
labelSelector:
matchExpressions:
- values:
- postgres
key: noobaa-db
operator: In
postHooks:
- exec:
command:
- "/bin/bash"
- "-c"
- /usr/bin/zcat /var/lib/pgsql/data/backup/pg_dumpall.gz | /usr/bin/pg_restore -U postgres
container: db
execTimeout: 180s
waitForReady: true
onError: Fail
name: pg_restore
includedNamespaces:
- openshift-storage
backupName: ${BACKUPID}
- Once the restore is created, you will see the noobaa-db-pg-0 pod come back up running with contents restored. You may be shown an indefinate
Progressingor aPartially Failedmessage when restoring. This is normal, scale up the NooBaa stack.
$ oc -n openshift-storage scale --replicas=1 deploy/noobaa-operator deploy/noobaa-endpoint sts/noobaa-db-pg sts/noobaa-core
- The
noobaa-operator-lockconfigmap will reappear and will place a lock on the noobaa-operator which prevents any further reconciliation. If it is present, delete it, and the noobaa-operator will begin to reconcile.
$ oc -n openshift-storage get cm noobaa-operator-lock
$ oc -n openshift-storage delete cm noobaa-operator-lock
- At this point your s3 buckets may be reachable, however, if they're not yet reachable, execute the steps in section 12.1. Restoring the Multicloud Object Gateway of the ODF Troubleshooting documentation. After ~3 minutes s3 bucket access should be restored, and NooBaa in a
Readyphase.
$ oc get noobaa -n openshift-storage
- Because of the noobaa-operator-lock configmap, the restore job does complete successfully, but it will keep the state of "Progessing" or "Partially Failed." This is expected behavior. Once all data is confirmed to be back/restored, you can delete the restore job. Also note that there is a finalizer on the restore job, don't forget to remove the finalizer in the yaml for the restore job to completely go away.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.