Investigating and Fixing Duplicates inside NooBaa-DB PostgreSQL DB Causing Backingstore Flapping or Performance Degradation in OpenShift Data Foundation (ODF)
Environment
Red Hat OpenShift Container Platform (RHOCP) v4.x
Red Hat OpenShift Data Foundations (RHODF) v4.18 and below
Red Hat OpenShift Data Foundations (RHODF) v4.19 and above (see minor change in commands)
Issue
- ODF object buckets are experiencing intermittent connectivity. Behaviors are commonly observed after the manual PostgreSQL v12-v15 upgrade was accomplished after upgrading ODF from ODF v4.14 to ODF v4.15+. Observations include longer queries shown in the noobaa-core pod, increased resource consumption on
noobaa-db-pg-0, and timeouts on the noobaa-endpoint pod.
$ oc adm top pods | grep noobaa CPU Memory
noobaa-core-0 349m 667Mi
noobaa-db-pg-0 12188m 15199Mi <----
noobaa-endpoint-556cfc78cf-j84cb 150m 825Mi
noobaa-endpoint-1-logs.out:Mar-25 7:10:26.870 [Endpoint/13] [WARN] core.sdk.object_io:: upload_object: failed upload { bucket: <omitted>, key: '<omitted>', md_conditions: undefined, obj_id: '67e256e275e186000df74245', size: 8605, num_parts: 1, md5_b64: 'd7+e/SPZ16Qp/oflPYGzTA==', sha256_b64: '<omitted>' } [RpcError: duplicate key value violates unique constraint "idx_btree_objectmds_latest_version_index"] { rpc_code: 'INTERNAL', rpc_data: { retryable: true } }
-
The resolution steps can be applied also in ODF 4.19 or higher were noobaa-db is in HA mode with CNPG, with two noobaa db pods running, in this example
noobaa-db-pg-cluster-1is the primary , so commands will be performed only on the primary pod.% oc get cluster NAME AGE INSTANCES READY STATUS PRIMARY noobaa-db-pg-cluster 2d10h 2 2 Cluster in healthy state noobaa-db-pg-cluster-1 noobaa-db-pg-cluster-1 1/1 Running 0 2d10h noobaa-db-pg-cluster-2 1/1 Running 0 2d10hODF 4.19 or higher: On the Resolution section below , some commands need a minor change: Where it says:
noobaa-db-pg-0change to the primary pod, in this example:noobaa-db-pg-cluster-1/var/lib/pgsql/data/change to/var/lib/postgres/data/
Resolution
Pre-requisite: Download the fix_id_dups.sql script attached to this solution and save it to your OC Bastion.
- The less aggressive way to accomplish this process is to implement the following two articles and monitor the database for data reduction.
If there is a significant drop in data usage on the PVC, you can stop here. If there is not, and duplicates are still at play, continue to step 2.
-
Capture the output of the database size and the tables size by following the steps outlined in the How to Check the Size/Consumption of the PostgreSQL Database in the db-noobaa-db-pg-0 PVC article and save those outputs in the event Red Hat Support requests them. Additionally, use the outputs to ensure there is enough free space in the
db-noobaa-db-pg-0PVC for the database dump (backup). If there isn't enough free space, please expand the PVC by following the steps outlined in the Expanding the db-noobaa-db-pg-0 PVC solution. -
Stop the operator, core, and endpoints. Validate pod termination:
$ oc scale deployment -n openshift-storage noobaa-operator noobaa-endpoint --replicas=0
$ oc scale sts -n openshift-storage noobaa-core --replicas=0
- Perform a backup of the db and copy the db out of the pod an onto your bastion for safe-keeping (change the target directory name to match your bastion):
$ oc exec -n openshift-storage -it noobaa-db-pg-0 -- bash -c 'pg_dumpall -U postgres > /var/lib/pgsql/data/dump.sql'
- Validate the dump was successful:
$ oc exec -n openshift-storage -it noobaa-db-pg-0 -- bash -c 'tail -n100 /var/lib/pgsql/data/dump.sql'
-- PostgreSQL database dump complete <--------------- SHOULD SEEE THIS
--
-- PostgreSQL database cluster dump complete
- Copy the dump outside of the pod:
$ oc cp -n openshift-storage noobaa-db-pg-0:/var/lib/pgsql/data/dump.sql /<target-directory-name>/dump.sql
- Copy the script into the db pod (change the source directory name to match your bastion):
$ oc cp -n openshift-storage /<source-directory-name>/fix_id_dups.sql noobaa-db-pg-0:/var/lib/pgsql/data/fix_id_dups.sql
NOTE: To ensure/validate data integrity, the checksum sha values will be below:
md5sum fix_id_dups_0.sql
99839229a035dd01027feab1e1773fd2 fix_id_dups_0.sql
- rsh into the db pod and navigate to the data folder:
$ oc -n openshift-storage rsh noobaa-db-pg-0
$ cd /var/lib/pgsql/data/
- Execute the fix_id_dups.sql script:
$ psql -d nbcore -f fix_id_dups.sql
- Connect to the
nbcoredatabase and ensure the current tables are set to the ownernoobaa, then rebuild the indexes.
a. Connect to nbcore:
$ psql -d nbcore
b. Copy/Paste the following and execute:
DO $$
DECLARE
row record;
BEGIN
FOR row IN
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
LOOP
EXECUTE format('ALTER TABLE %I.%I OWNER TO noobaa', row.schemaname, row.tablename);
END LOOP;
END;
$$;
c. Rindex the nbcore database:
REINDEX DATABASE nbcore;
- Disconnect from the
nbcoredatabase, exit the db pod, and scale the NooBaa pods back up:
nbcore=# \q
$ exit
$ oc scale deployment -n openshift-storage noobaa-operator noobaa-endpoint --replicas=1
$ oc scale sts -n openshift-storage noobaa-core --replicas=1
-
Allow a few minutes for the pods to come running. Once fully running give one final restart ALL noobaa pods IAW: 12.1. Restoring the Multicloud Object Gateway product documentation.
-
Check the following command outputs to validate that Noobaa and the backingstore(s) are in
PHASE: Readystatus and verify that previous behavior is no longer apparent.
$ oc get noobaa -n openshift-storage
$ oc get backingstore -n openshift-storage
Root Cause
Commonly seen after upgrading to PostgreSQL 15 (ODF v4.14 -> ODF v4.15) several indexes were not created due to duplicated rows violating the index requirements leading to degraded performance/operations.
Related Articles:
Recovering Multicloud Object Gateway upgrade failure in ODF 4.15
Recovery NooBaa's PostgreSQL upgrade failure in OpenShift Data Foundation 4.16
Diagnostic Steps
$ oc logs -n openshift-storage noobaa-core-0 | grep -i "QUERY_LOG: LONG QUERY" | wc -l
473
Mar-7 17:34:35.245 [BGWorkers/32] [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) -
Mar-7 17:34:50.259 [BGWorkers/32] [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) -
- Detailed example of the SQL query :
$ oc logs -n openshift-storage noobaa-core-0 | grep -i "QUERY_LOG: LONG QUERY" -A1
Mar-12 8:42:08.240 [BGWorkers/31] [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) -
please check whether the DB and core pods have sufficient CPU and memory {"tag":"T00000000|Q00098767","took":"8898.4ms","query":{"text":"SELECT * FROM datablocks WHERE (data ? 'deleted' and NOT data ? 'reclaimed') ORDER BY data->>'_id' DESC LIMIT 1000"},"clients_pool":{"total":10,"waiting":0,"idle":9},"stack":[" at log_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:192:17)"," at _do_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:260:19)"," at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"," at async PostgresTable.find (/root/node_modules/noobaa-core/src/util/postgres_client.js:872:25)"," at async AgentBlocksReclaimer.run_agent_blocks_reclaimer (/root/node_modules/noobaa-core/src/server/bg_services/agent_blocks_reclaimer.js:39:28)"]}
Notice the query above is:
SELECT * FROM datablocks WHERE (data ? 'deleted' and NOT data ? 'reclaimed') ORDER BY data->>'_id' DESC LIMIT 1000
- We have seen this query repeated again and again on the noobaa-core logs,
$ grep LONG noobaa-core-0-core.log -A1 |grep "SELECT \* FROM datablocks WHERE (data ? 'deleted' and NOT data ? 'reclaimed') ORDER BY data->>'_id' DESC LIMIT 1000"|wc -l
58
$ grep LONG noobaa-core-0-core.log|wc -l
59
and this other LONG QUERY observed only once:
Mar-12 8:21:18.466 [BGWorkers/31] [WARN] core.util.postgres_client:: QUERY_LOG: LONG QUERY (OVER 5000 ms) -
please check whether the DB and core pods have sufficient CPU and memory {"tag":"T00000000|Q00000102","took":"26794.0ms","query":{"text":"SELECT COUNT(*) FROM datachunks WHERE data->>'dedup_key'='[object Object]'"},"clients_pool":{"total":10,"waiting"
:0,"idle":9},"stack":[" at log_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:192:17)"," at _do_query (/root/node_modules/noobaa-core/src/util/postgres_client.js:260:19)"," at process.processTicksAndRejections (node:internal/process/tas
k_queues:95:5)"," at async PostgresTable.countDocuments (/root/node_modules/noobaa-core/src/util/postgres_client.js:694:25)"," at async Promise.all (index 1)"," at async Promise.all (index 1)"]}
$ oc logs -n openshift-storage noobaa-endpoint-<pod-name> | grep -i timeout
Read timeout on endpoint URL: "https://s3.openshift-storage.svc:443
Error: RPC REQUEST TIMEOUT while waiting for response
noobaa-operator-8d86d5bb7-nps2x 26m 84Mi
How to Obtain NooBaa DB PostgreSQL Logs and Configurations - OpenShift Data Foundation (ODF)
postgresql-Wed.log:2025-02-26 13:17:18.231 UTC [925] ERROR: could not create unique index "idx_btree_datachunks__id_index"
postgresql-Wed.log-2025-02-26 13:17:18.231 UTC [925] DETAIL: Key ((data ->> '_id'::text))=(679319b8214872000df42e6e) is duplicated.<----
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.