Creating an Automated OADP Backup for OpenShift Data Foundation (NooBaa-DB)

Solution Verified - Updated 1 Aug 2025

Environment

Red Hat OpenShift Cluster Platform (OCP) 4.x
Red Hat OpenShift Data Foundation (ODF) 4.18 and below
Red Hat OpenShift API for Data Protection (OADP) 1.4.1+

Issue

Preface:

Because the ODF Operator and included resources are supported by the ODF team and the OADP Operator is supported by the Shift Storage team, the author of this solution collaborated with various teams/colleagues to produce this solution.

This solution was tested on OCP v4.17.17 and ODF v4.17.5 by storing objects in buckets, backing up with the OADP application, and deleting the db-noobaa-db-pg-0 PVC. This caused NooBaa to enter a Configuring phase and resulted in a state of not being able to access the bucket contents via s3. Once the db was restored, the objects were once again accessible, and read/write operations for both new/old objects was made possible again.

time="2025-01-23T19:46:45Z" level=error msg="⚠️  RPC: system.read_system() Response Error: Code=UNAUTHORIZED Message=account not found 661ff9fa3e37850029df061a"

Accidental db-noobaa-db-pg-0 PVC deletions are one of the most common instances of data loss regarding NooBaa. Once NooBaa entered this state, a restore was created in OADP using the backup ID. Performing the steps in this solution resulted in a successful restoration of NooBaa back to a Ready phase.

Best Practices:

The best practice for storing ODF/NooBaa (MCG) backups is to store them OUTSIDE of the cluster. For example, a NooBaa bucket in a different OCP cluster can store the backups or an AWS s3 bucket. Because not all customers have access to external environments for s3 backup storage (separate cluster, cloud/AWS), or if they're in a disconnected environment this solution was written to use Ceph's internal Rados Gateway (RGW) to store the backups.

Although this is not the best practice to use an internal object service for backup storage, the vast majority of observed data losses reveal that if the backups were simply stored in an internal s3 solution such RGW because an external solution was not available, a restore would've been possible. It's not recommended to deviate from best practices, however, all customers (including disconnected environments) have access to RGW, hence the reason why this solution is showing RGW (greater reach).

In Bare Metal/VM environments, RGW is created by default during the creation of the storagesystem/storagecluster. In other environments where the cephobjectstore (RGW) resource is not created, it can be manually created. See Content from red-hat-storage.github.io is not included.Enable use of the RGW on an OCS internal deployment for more information.

If AWS/an s3 compatible storage is available, just substitute the bucket name, external route (endpoint), access, and access secret key to reflect the backup destination bucket in the DataProtectionApplication YAML.

Resolution

DISCLAIMER: It will be the user's responsibility to perform their OWN research regarding the configuration of custom/tailored OpenShift backups based on user requirements in addition to testing the restore. This solution only serves as a template to assist in the OADP backup process. Red Hat accepts NO responsibility regarding the effectiveness of this solution. Additionally, many variables can be considered in regard to backing up with OADP. Below will consist of a configuration for backing up the NooBaa application, NooBaa-DB, and many other resources in the openshift-storage namespace. A portion of the information below has been derived from the OADP Backing up applications product documentation for additional guidance. Lastly, please consider backup space and backup retention/number of copies when planning your backups.

Configuration:

Install the latest OADP Operator. By default, the operator creates and installs in the openshift-adp namespace. However, it is possible to have the operator installed in multiple/separate namespaces if wanting to create multiple instances.
Create your s3 Object Bucket Claim (OBC) for backups to be stored. To re-iterate, best practices state that the backups should be stored outside of the cluster. See your respective s3 bucket/cloud storage provider's documentation for s3 bucket creation. For this solution, we'll be utilizing Ceph's Rados Gateway (RGW) which all customers have access to. Follow section 9.3. Creating an Object Bucket Claim using the OpenShift Web Console, name the OBC, and be sure to select storageclass ocs-storagecluster-ceph-rgw.
After the OBC is created, scroll down on the OBC page and you will see hidden values that can be revealed if needed. These values contain the target bucket name (long name, NOT obc name), access/secret access key, and endpoint. The endpoint will be the internal service endpoint, however, for external s3 buckets you'll need the external route/endpoint. For internal mode clusters you can use the internal RGW service or external RGW route. For external mode clusters, you will need the external route. All four items will be needed to create the data application.
Using a text editor, create a secret named credentials.yaml with the access and secret access key in the namespace the OADP operator is installed in.

a. Create the file.

credentials.yaml file contents:

[default]
aws_access_key_id=<ACCESS_KEY_ID>
aws_secret_access_key=<SECRET_ACCESS_KEY>

b. Create the secret:

$ oc create secret generic cloud-credentials -n openshift-adp --from-file cloud=credentials.yaml

In the OCP Console, navigate to Operators -> Installed Operators -> OADP Operator and click "Create Instance" in the DataProtectionApplication (DPA) tile. Click YAML view, and input the following contents:

Common Fields to Change: name, bucket, s3Url, credential <--(name matches secret)

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  name: velero
  namespace: openshift-adp
spec:
  configuration:
    nodeAgent:
      enable: true
      uploaderType: kopia
    velero:
      defaultPlugins:
        - openshift
        - aws
        - kubevirt
  snapshotLocations:
    - velero:
        config:
          profile: default
          region: us-east-1
        provider: aws
  backupLocations:
    - velero:
        config:
          insecureSkipTLSVerify: 'true'
          profile: default
          region: us-east-1
          s3ForcePathStyle: 'true'
          s3Url: 'http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80'
        credential:
          key: cloud
          name: cloud-credentials
        objectStorage:
          bucket: rgw-long-bucket-name-xxxxx-xxxx-xxxx
          prefix: velero
        default: true
        provider: aws

NOTE: Once the application is created you should see Condition: Reconciled, this indicates success. The BackupStorageLocation and VolumeSnapshotLocation tabs will not be populated with a name. You will need that name for the backup. In the example above, the name of the application was velero, which means the BackupStorageLocation and VolumeSnapshotLocation will be named velero-1.

Backup:

Create a backup. In the OCP Console, navigate to Operators -> Installed Operators -> OADP Operator and click "Create Instance" in the Backup (B) tile. Click YAML view, and input the following contents:

Common Fields to Change: volumeSnapshotLocations, storageLocation

One-Time Backups:

apiVersion: velero.io/v1
kind: Backup
metadata:
  generateName: openshift-storage-backup-
  labels:
    app: openshift-storage
    component: noobaa-db
  namespace: openshift-adp
spec:
  csiSnapshotTimeout: 10m0s
  datamover: kopia
  defaultVolumesToFsBackup: true
  hooks:
    resources:
      - includedNamespaces:
          - openshift-storage
        labelSelector:
          matchExpressions:
            - key: noobaa-db
              operator: In
              values:
                - postgres
        name: pg_dump
        pre:
          - exec:
              command:
                - mkdir 
                - -p
                - /var/lib/pgsql/data/backup
              container: db
              onError: Continue
              timeout: 10s
          - exec:
              command:
                - rm
                - -f
                - /var/lib/pgsql/data/backup/pg_dumpall.gz
              container: db
              onError: Continue
              timeout: 10s
          - exec:
              command:
                - /bin/bash
                - -c
                - /usr/bin/pg_dumpall -U postgres -c | /usr/bin/gzip -9 -f > /var/lib/pgsql/data/backup/pg_dumpall.gz
              container: db
              onError: Fail
              timeout: 1800s
  includedNamespaces:
    - openshift-storage
  itemOperationTimeout: 0h5m0s
  metadata: {}
  snapshotMoveData: true
  storageLocation: velero-1
  ttl: 24h0m0s
  volumeSnapshotLocations:
    - velero-1

Scheduled Backups, Example Shows Everyday at 2am (change as needed):

apiVersion: velero.io/v1
kind: Schedule
metadata:
  annotations: {}
  name: daily-openshift-storage-backup
  namespace: openshift-adp
  labels:
    app: openshift-storage
    component: noobaa-db
spec:
  schedule: '0 2 * * * '
  skipImmediately: false
  template:
    csiSnapshotTimeout: 10m0s
    datamover: kopia
    defaultVolumesToFsBackup: true
    hooks:
      resources:
        - includedNamespaces:
            - openshift-storage
          labelSelector:
            matchExpressions:
              - key: noobaa-db
                operator: In
                values:
                  - postgres
          name: pg_dump
          pre:
            - exec:
                command:
                  - mkdir 
                  - -p
                  - /var/lib/pgsql/data/backup
                container: db
                onError: Continue
                timeout: 10s
            - exec:
                command:
                  - rm
                  - -f
                  - /var/lib/pgsql/data/backup/pg_dumpall.gz
                container: db
                onError: Continue
                timeout: 10s
            - exec:
                command:
                  - /bin/bash
                  - -c
                  - /usr/bin/pg_dumpall -U postgres -c | /usr/bin/gzip -9 -f > /var/lib/pgsql/data/backup/pg_dumpall.gz
                container: db
                onError: Fail
                timeout: 1800s # 30m
    includedNamespaces:
      - openshift-storage
    itemOperationTimeout: 0h30m0s
    snapshotMoveData: true
    storageLocation: velero-1
    ttl: 78h0m0s
    volumeSnapshotLocations:
      - velero-1

Once the backup is created it will progress and you should see a green checkmark when it's complete. Each backup will have it's own backup ID which will be needed for restore.

Restore:

When creating a restore in OADP it will validate the backup contents to the current contents in the openshift-storage namespace and patch/restore what is missing. The restore below will work in many scenarios, however, the particular restore was tested using the most common cause of data loss using NooBaa which is accidental PVC deletion. In testing, the db-noobaa-db-pg-0 PVC was deleted along with the pod as well. This freed up the volume mount. Once the noobaa-db-pg-0 pod came back up, NooBaa reconciles by creating a new PVC which will be missing the old data and all access to previous OBCs has now been lost.

In order to successfully restore NooBaa, particularly noobaa-db-pg-0 and its database contents, the following prerequisites must be done.

Warning: It is not recommended to test this in production. Please test in a sandbox (lab). Only accomplish these steps if your data has been validated as missing and needs restoration. If in doubt, follow the steps in the Backup and Restore for Multicloud Object Gateway database (NooBaa DB) solution prior to executing the steps below in the event the db was still useful, but there was another/separate issue at play.

a. Scale down the NooBaa stack:

$ oc -n openshift-storage scale --replicas=0 deploy/noobaa-operator deploy/noobaa-endpoint sts/noobaa-db-pg sts/noobaa-core

b. Delete the current noobaa-db-pg-0 pod PVC.

$ oc -n openshift-storage delete pvc -l noobaa-db=postgres

c. Delete the noobaa-db-pg-0 pod statefulset:

$ oc -n openshift-storage delete statefulset/noobaa-db-pg

d. Validate whether the noobaa-operator-lock configmap is present or not. If it is present, delete it. This will most likely reappear during the restore and will need to be deleted again.

$ oc -n openshift-storage get cm noobaa-operator-lock
$ oc -n openshift-storage delete cm noobaa-operator-lock

Create a restore. In the OCP Console, navigate to Operators -> Installed Operators -> OADP Operator and click "Create Instance" in the Restore (R) tile. Click YAML view, and input the following contents:

Common Fields to Change: backupName <--(needs to match backup ID that will be used)

apiVersion: velero.io/v1
kind: Restore
metadata:
  generateName: openshift-storage-restore-
  namespace: openshift-adp
  labels:
    app: openshift-storage
    component: noobaa-db
spec:
  excludedResources:
  - events.events.k8s.io
  - backingstores.noobaa.io
  - bucketclasses.noobaa.io
  - cephblockpoolradosnamespaces.ceph.rook.io
  - cephblockpools.ceph.rook.io
  - ephbucketnotifications.ceph.rook.io
  - cephbuckettopics.ceph.rook.io
  - cephclients.ceph.rook.io
  - cephclusters.ceph.rook.io
  - cephcosidrivers.ceph.rook.io
  - cephfilesystemmirrors.ceph.rook.io
  - cephfilesystems.ceph.rook.io
  - cephfilesystemsubvolumegroups.ceph.rook.io
  - cephnfses.ceph.rook.io
  - cephobjectrealms.ceph.rook.io
  - cephobjectstores.ceph.rook.io
  - cephobjectstoreusers.ceph.rook.io
  - cephobjectzonegroups.ceph.rook.io
  - cephobjectzones.ceph.rook.io
  - cephrbdmirrors.ceph.rook.io
  - ceph.rook.io
  - namespacestores.noobaa.io
  - noobaaaccounts.noobaa.io
  - noobaas.noobaa.io
  - objectbuckets.objectbucket.io
  - storagesystem.odf.openshift.io
  - storagecluster.ocs.openshift.io
  - csiaddons.openshift.io
  - deployment.apps
  - rbac.authorization.k8s.io
  - RoleBinding
  - rolebindings
  - Role
  - roles
  - Route
  - events
  - jobs
  - Lease
  - DaemonSet
  - ControllerRevision
  - ClusterServiceVersion
  - ControllerRevision
  - Endpoints
  - EndpointSlice
  - HorizontalPodAutoscaler
  - OCSInitialization
  - OperatorCondition
  - OperatorGroup
  - ReclaimSpaceJob
  - ReplicaSet
  - authorization.openshift.io
  hooks:
    resources:
      - includedNamespaces:
          - openshift-storage
        labelSelector:
          matchExpressions:
            - values:
                - postgres
              key: noobaa-db
              operator: In
        postHooks:
          - exec:
              command:
                - "/bin/bash"
                - "-c"
                - /usr/bin/zcat /var/lib/pgsql/data/backup/pg_dumpall.gz | /usr/bin/pg_restore -U postgres
              container: db
              execTimeout: 180s
              waitForReady: true
              onError: Fail
        name: pg_restore
  includedNamespaces:
    - openshift-storage
  backupName: ${BACKUPID}

Once the restore is created, you will see the noobaa-db-pg-0 pod come back up running with contents restored. You may be shown an indefinate Progressing or a Partially Failed message when restoring. This is normal, scale up the NooBaa stack.

$ oc -n openshift-storage scale --replicas=1 deploy/noobaa-operator deploy/noobaa-endpoint sts/noobaa-db-pg sts/noobaa-core

The noobaa-operator-lock configmap will reappear and will place a lock on the noobaa-operator which prevents any further reconciliation. If it is present, delete it, and the noobaa-operator will begin to reconcile.

$ oc -n openshift-storage get cm noobaa-operator-lock
$ oc -n openshift-storage delete cm noobaa-operator-lock

At this point your s3 buckets may be reachable, however, if they're not yet reachable, execute the steps in section 12.1. Restoring the Multicloud Object Gateway of the ODF Troubleshooting documentation. After ~3 minutes s3 bucket access should be restored, and NooBaa in a Ready phase.

$ oc get noobaa -n openshift-storage

Because of the noobaa-operator-lock configmap, the restore job does complete successfully, but it will keep the state of "Progessing" or "Partially Failed." This is expected behavior. Once all data is confirmed to be back/restored, you can delete the restore job. Also note that there is a finalizer on the restore job, don't forget to remove the finalizer in the yaml for the restore job to completely go away.

SBR

Product(s)

Components

Storage

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.