Investigating and Fixing Duplicates inside NooBaa-DB PostgreSQL DB Causing Backingstore Flapping or Performance Degradation in OpenShift Data Foundation (ODF)

Solution Verified - Updated

Environment

Red Hat OpenShift Container Platform (RHOCP) v4.x
Red Hat OpenShift Data Foundations (RHODF) v4.18 and below
Red Hat OpenShift Data Foundations (RHODF) v4.19 and above (see minor change in commands)

Issue

  • ODF object buckets are experiencing intermittent connectivity. Behaviors are commonly observed after the manual PostgreSQL v12-v15 upgrade was accomplished after upgrading ODF from ODF v4.14 to ODF v4.15+. Observations include longer queries shown in the noobaa-core pod, increased resource consumption on noobaa-db-pg-0, and timeouts on the noobaa-endpoint pod.
$ oc adm top pods | grep noobaa                                     CPU          Memory
noobaa-core-0                                                       349m         667Mi
noobaa-db-pg-0                                                      12188m       15199Mi <----
noobaa-endpoint-556cfc78cf-j84cb                                    150m         825Mi
noobaa-endpoint-1-logs.out:Mar-25 7:10:26.870 [Endpoint/13]  [WARN] core.sdk.object_io:: upload_object: failed upload { bucket: <omitted>, key: '<omitted>', md_conditions: undefined, obj_id: '67e256e275e186000df74245', size: 8605, num_parts: 1, md5_b64: 'd7+e/SPZ16Qp/oflPYGzTA==', sha256_b64: '<omitted>' } [RpcError: duplicate key value violates unique constraint "idx_btree_objectmds_latest_version_index"] { rpc_code: 'INTERNAL', rpc_data: { retryable: true } }
  • The resolution steps can be applied also in ODF 4.19 or higher were noobaa-db is in HA mode with CNPG, with two noobaa db pods running, in this example noobaa-db-pg-cluster-1 is the primary , so commands will be performed only on the primary pod.

    % oc get cluster
    NAME                   AGE     INSTANCES   READY   STATUS                     PRIMARY
    noobaa-db-pg-cluster   2d10h   2           2       Cluster in healthy state   noobaa-db-pg-cluster-1
    
    noobaa-db-pg-cluster-1                                            1/1     Running     0               2d10h
    noobaa-db-pg-cluster-2                                            1/1     Running     0               2d10h
    

    ODF 4.19 or higher: On the Resolution section below , some commands need a minor change: Where it says:

    noobaa-db-pg-0 change to the primary pod, in this example: noobaa-db-pg-cluster-1

    /var/lib/pgsql/data/ change to /var/lib/postgres/data/

Resolution

Pre-requisite: Download the fix_id_dups.sql script attached to this solution and save it to your OC Bastion.

  1. The less aggressive way to accomplish this process is to implement the following two articles and monitor the database for data reduction.

How to reduce the retention period of deleted metadata in the ODF Multicloud Gateway MCG Noobaa DB database

How to increase the frequency of deleting metadata in the ODF Multicloud Gateway MCG Noobaa DB database when the Noobaa DB is constantly growing

If there is a significant drop in data usage on the PVC, you can stop here. If there is not, and duplicates are still at play, continue to step 2.

  1. Capture the output of the database size and the tables size by following the steps outlined in the How to Check the Size/Consumption of the PostgreSQL Database in the db-noobaa-db-pg-0 PVC article and save those outputs in the event Red Hat Support requests them. Additionally, use the outputs to ensure there is enough free space in the db-noobaa-db-pg-0 PVC for the database dump (backup). If there isn't enough free space, please expand the PVC by following the steps outlined in the Expanding the db-noobaa-db-pg-0 PVC solution.

  2. Stop the operator, core, and endpoints. Validate pod termination:

 $ oc scale deployment -n openshift-storage noobaa-operator noobaa-endpoint --replicas=0
 $ oc scale sts -n openshift-storage noobaa-core --replicas=0
  1. Perform a backup of the db and copy the db out of the pod an onto your bastion for safe-keeping (change the target directory name to match your bastion):
$ oc exec -n openshift-storage -it noobaa-db-pg-0 -- bash -c 'pg_dumpall -U postgres > /var/lib/pgsql/data/dump.sql'
  1. Validate the dump was successful:
$ oc exec -n openshift-storage -it noobaa-db-pg-0 -- bash -c 'tail -n100 /var/lib/pgsql/data/dump.sql'
-- PostgreSQL database dump complete <--------------- SHOULD SEEE THIS
--
-- PostgreSQL database cluster dump complete
  1. Copy the dump outside of the pod:
$ oc cp -n openshift-storage noobaa-db-pg-0:/var/lib/pgsql/data/dump.sql /<target-directory-name>/dump.sql
  1. Copy the script into the db pod (change the source directory name to match your bastion):
$ oc cp -n openshift-storage /<source-directory-name>/fix_id_dups.sql noobaa-db-pg-0:/var/lib/pgsql/data/fix_id_dups.sql

NOTE: To ensure/validate data integrity, the checksum sha values will be below:

md5sum fix_id_dups_0.sql 
99839229a035dd01027feab1e1773fd2  fix_id_dups_0.sql
  1. rsh into the db pod and navigate to the data folder:
$ oc -n openshift-storage rsh noobaa-db-pg-0

$ cd /var/lib/pgsql/data/
  1. Execute the fix_id_dups.sql script:
$  psql -d nbcore -f fix_id_dups.sql
  1. Connect to the nbcore database and ensure the current tables are set to the owner noobaa, then rebuild the indexes.

a. Connect to nbcore:

$ psql -d nbcore

b. Copy/Paste the following and execute:

DO $$
DECLARE
    row record;
BEGIN
    FOR row IN 
        SELECT schemaname, tablename 
        FROM pg_tables 
        WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
    LOOP
        EXECUTE format('ALTER TABLE %I.%I OWNER TO noobaa', row.schemaname, row.tablename);
    END LOOP;
END;
$$; 

c. Rindex the nbcore database:

REINDEX DATABASE nbcore;

  1. Disconnect from the nbcore database, exit the db pod, and scale the NooBaa pods back up:
nbcore=# \q
$ exit
$ oc scale deployment -n openshift-storage noobaa-operator noobaa-endpoint --replicas=1
$ oc scale sts -n openshift-storage noobaa-core --replicas=1
  1. Allow a few minutes for the pods to come running. Once fully running give one final restart ALL noobaa pods IAW: 12.1. Restoring the Multicloud Object Gateway product documentation.

  2. Check the following command outputs to validate that Noobaa and the backingstore(s) are in PHASE: Ready status and verify that previous behavior is no longer apparent.

$ oc get noobaa -n openshift-storage

$ oc get backingstore -n openshift-storage

Root Cause

Commonly seen after upgrading to PostgreSQL 15 (ODF v4.14 -> ODF v4.15) several indexes were not created due to duplicated rows violating the index requirements leading to degraded performance/operations.

Related Articles:

Recovering Multicloud Object Gateway upgrade failure in ODF 4.15

Recovery NooBaa's PostgreSQL upgrade failure in OpenShift Data Foundation 4.16

Diagnostic Steps

$ oc logs -n openshift-storage noobaa-core-0  | grep -i "QUERY_LOG: LONG QUERY" | wc -l
473

Mar-7 17:34:35.245 [BGWorkers/32]  [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) -
Mar-7 17:34:50.259 [BGWorkers/32]  [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) -
  1. Detailed example of the SQL query :
$ oc logs -n openshift-storage noobaa-core-0  | grep -i "QUERY_LOG: LONG QUERY" -A1

Mar-12 8:42:08.240 [BGWorkers/31]  [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) - 
            please check whether the DB and core pods have sufficient CPU and memory  {"tag":"T00000000|Q00098767","took":"8898.4ms","query":{"text":"SELECT * FROM datablocks WHERE (data ? 'deleted' and  NOT data ? 'reclaimed') ORDER BY data->>'_id' DESC LIMIT 1000"},"clients_pool":{"total":10,"waiting":0,"idle":9},"stack":["    at log_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:192:17)","    at _do_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:260:19)","    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)","    at async PostgresTable.find (/root/node_modules/noobaa-core/src/util/postgres_client.js:872:25)","    at async AgentBlocksReclaimer.run_agent_blocks_reclaimer (/root/node_modules/noobaa-core/src/server/bg_services/agent_blocks_reclaimer.js:39:28)"]}

Notice the query above is:

SELECT * FROM datablocks WHERE (data ? 'deleted' and  NOT data ? 'reclaimed') ORDER BY data->>'_id' DESC LIMIT 1000
  1. We have seen this query repeated again and again on the noobaa-core logs,
$ grep LONG noobaa-core-0-core.log -A1 |grep "SELECT \* FROM datablocks WHERE (data ? 'deleted' and  NOT data ? 'reclaimed') ORDER BY data->>'_id' DESC LIMIT 1000"|wc -l
58

$ grep LONG noobaa-core-0-core.log|wc -l
59

and this other LONG QUERY observed only once:

Mar-12 8:21:18.466 [BGWorkers/31]  [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) - 
            please check whether the DB and core pods have sufficient CPU and memory  {"tag":"T00000000|Q00000102","took":"26794.0ms","query":{"text":"SELECT COUNT(*) FROM datachunks WHERE data->>'dedup_key'='[object Object]'"},"clients_pool":{"total":10,"waiting"
:0,"idle":9},"stack":["    at log_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:192:17)","    at _do_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:260:19)","    at process.processTicksAndRejections (node:internal/process/tas
k_queues:95:5)","    at async PostgresTable.countDocuments (/root/node_modules/noobaa-core/src/util/postgres_client.js:694:25)","    at async Promise.all (index 1)","    at async Promise.all (index 1)"]}


$ oc logs -n openshift-storage noobaa-endpoint-<pod-name> | grep -i timeout
Read timeout on endpoint URL: "https://s3.openshift-storage.svc:443
Error: RPC REQUEST TIMEOUT while waiting for response
noobaa-operator-8d86d5bb7-nps2x                                   26m          84Mi

How to Obtain NooBaa DB PostgreSQL Logs and Configurations - OpenShift Data Foundation (ODF)

postgresql-Wed.log:2025-02-26 13:17:18.231 UTC [925] ERROR:  could not create unique index "idx_btree_datachunks__id_index"
postgresql-Wed.log-2025-02-26 13:17:18.231 UTC [925] DETAIL:  Key ((data ->> '_id'::text))=(679319b8214872000df42e6e) is duplicated.<----
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.