4.20 Release Notes

Red Hat OpenShift Data Foundation 4.20

Release notes for features and enhancements, known issues, and other important information.

Red Hat Storage Documentation Team

Abstract

The release notes for Red Hat OpenShift Data Foundation 4.20 summarizes all new features and enhancements, notable technical changes, and any known bugs upon general availability.

Chapter 1. Overview

Red Hat OpenShift Data Foundation is software-defined storage that is optimized for container environments. It runs as an operator on OpenShift Container Platform to provide highly integrated and simplified persistent storage management for containers.

Red Hat OpenShift Data Foundation is integrated into the latest Red Hat OpenShift Container Platform to address platform services, application portability, and persistence challenges. It provides a highly scalable backend for the next generation of cloud-native applications, built on a technology stack that includes Red Hat Ceph Storage, the Rook.io Operator, and NooBaa’s Multicloud Object Gateway technology.

Red Hat OpenShift Data Foundation is designed for FIPS. When running on RHEL or RHEL CoreOS booted in FIPS mode, OpenShift Container Platform core components use the RHEL cryptographic libraries submitted to NIST for FIPS Validation on only the x86_64, ppc64le, and s390X architectures. For more information about the NIST validation program, see Content from csrc.nist.gov is not included.Cryptographic Module Validation Program. For the latest NIST status for the individual versions of the RHEL cryptographic libraries submitted for validation, see This content is not included.Compliance Activities and Government Standards.

Red Hat OpenShift Data Foundation provides a trusted, enterprise-grade application development environment that simplifies and enhances the user experience across the application lifecycle in a number of ways:

  • Provides block storage for databases.
  • Shared file storage for continuous integration, messaging, and data aggregation.
  • Object storage for cloud-first development, archival, backup, and media storage.
  • Scale applications and data exponentially.
  • Attach and detach persistent data volumes at an accelerated rate.
  • Stretch clusters across multiple data-centers or availability zones.
  • Establish a comprehensive application container registry.
  • Support the next generation of OpenShift workloads such as Data Analytics, Artificial Intelligence, Machine Learning, Deep Learning, and Internet of Things (IoT).
  • Dynamically provision not only application containers, but data service volumes and containers, as well as additional OpenShift Container Platform nodes, Elastic Block Store (EBS) volumes and other infrastructure services.

1.1. About this release

Red Hat OpenShift Data Foundation 4.20 is now available. New enhancements, features, and known issues that pertain to OpenShift Data Foundation 4.20 are included in this topic.

Red Hat OpenShift Data Foundation 4.20 is supported on the Red Hat OpenShift Container Platform version 4.20. For more information, see This content is not included.Red Hat OpenShift Data Foundation Supportability and Interoperability Checker.

For Red Hat OpenShift Data Foundation life cycle information, refer to This content is not included.Product Life Cycles.

1.2. Important notice regarding Regional Disaster Recovery

Important

Do not enroll new workloads for DR protection using Regional Disaster Recovery (RDR) in OpenShift Data Foundation 4.20.0.

It is recommended to wait for a future release where this issue will be resolved.

Chapter 2. New features

This section describes new features introduced in Red Hat OpenShift Data Foundation 4.20.

2.1. Recipes with exec hooks for Disaster Recovery workloads

You can now use recipes with exec hooks in DR workloads. This expands support for a broader range of workloads by enabling more flexible and dynamic execution during DR operations.

2.2. Independent virtual machine (VM) DR control within namespaces

You can now perform failover and failback operations on individual VMs within a namespace, rather than being limited to managing all DR-protected VMs at once. This is valuable for scenarios like load balancing across clusters, where customers may want to migrate only a subset of VMs in a given namepsace.

Chapter 3. Enhancements

This section describes the major enhancements introduced in Red Hat OpenShift Data foundation 4.20.

3.1. Support forceful deployment of ODF

A new flag has been added to enable forceful deployment of the storage cluster. This helps redeployment with automation.

For more information, see the knowledgebase article Install Red Hat OpenShift Data Foundation 4.X in internal-attached mode using command line interface..

3.2. ODF Multus Support with IPv6

Multus networking feature is enhanced to support IPv6 in addition to IPv4. Either IPv4 or IPv6 can be configured for the Multus network.

For more information, see Multus architecture for OpenShift Data Foundation.

3.3. Automated key rotation and support for missing KMS in encrypted StorageClasses

With this release, annotations required for encrypted StorageClasses are now automatically added when Key Management Services (KMS) are missing. This streamlines key rotation and improves security configuration.

3.4. Pool level health status

Pool level alerts for near full and full status of the pool are now available with actionable messages.

For more information, see Resolving cluster alerts.

3.5. Multicloud Object Gateway

3.5.1. Unified CLI: mcg-cli capabilities integrated into odf-cli

Multicloud Object Gateway (MCG) commands are now available through the odf-cli utility. This enhancement consolidates ODF, Ceph, and MCG operations into a single command-line interface, eliminating the need to download and manage multiple binaries.

3.5.2. Public access limit option for S3 resources in MCG object browser

A new option is available in the MCG object browser to configure public access limits for S3 resources. This enhancement improves control over data exposure and strengthens security for object storage.

For more information about this procedure, see Setting up public access limit to S3 resources using MCG object browser.

3.5.3. Option to disable external access routes to MCG

A new configuration option to disable all routes that enable external access to the Multicloud Object Gateway (MCG). This feature helps ensure that MCG services are only accessible within the OpenShift environment.

For more information, see Securing Multicloud Object Gateway.

3.5.4. Bucket-Level metrics for replication state

New metrics have been introduced to provide detailed visibility into the replication progress of the buckets. These metrics help determine data safety and availability on the secondary site. The following metrics are now available per bucket, per replication cycle:

  • Total number of objects scanned
  • Number of objects successfully replicated
  • Number of objects that failed to replicate

For more information, see Obtaining metrics to reflect bucket replication state.

3.5.5. MCG introduces Metrics and AlertRule for detecting failures relating to noobaa-db

There is a new alert to detect failures relating to noobaa-db. This helps to improve the health visibility of NooBaa DB.

This helps to identify internal failures when there is a disconnect between the components while a component by itself is in ready state.

For more information, see Resolving alerts and errors.

3.6. Alert triggered when CSI clones near soft limit

An alert is triggered when CSI clones or snapshots approach the soft limit of 200. This notification recommends switching to volume snapshot cloning for better performance, helping to avoid delays caused by excessive clone operations.

3.7. Improvements to disaster recovery uninstall workflow

This release introduces improvements to the DR uninstall workflow, streamlining the removal of resources created during various stages of DR deployment. Previously, uninstalling DR components required manual cleanup across multiple layers. With this enhancement, the uninstall process is more intuitive and automated.

3.8. Smarter resource merging for component configuration

Resource requirements for OpenShift Data Foundation components are now merged with default values instead of being fully replaced. If a user specifies only one resource type such as memory, OpenShift Data Foundation will automatically apply default values for the other type, like CPU. Previously, partial specifications caused missing fields to be dropped, leading to components running without complete resource settings and resulting in unpredictable performance. This enhancement ensures safer and more balanced configurations with minimal user input.

(This content is not included.DFBUGS-426)

3.9. Selective merging of placement configurations

Placement configuration for components has been improved to support selective merging. Previously, specifying any placement section such as, Tolerations would override the entire default placement, leaving other sections like Node Affinity or Topology Spread Constraints empty. This caused incomplete or suboptimal placement configurations.

With this enhancement, OpenShift Data Foundation now merges user-defined placement values with the default configuration. Users can specify only the sections they want to customize, while defaults for other placement types are preserved, ensuring consistent and expected placement behavior for components.

(This content is not included.DFBUGS-3835)

3.10. Support custom SCCs for VolSync DataMover pods

VolSync DataMover pods could not access data when using custom Security Context Constraints (SCCs), leading to sync failures in clusters with custom configurations. The DRPC spec now includes a VolSyncSpec field that allows users to configure the following:

  • MoverSecurityContext: Defines the PodSecurityContext for DataMover pods
  • MoverServiceAccount: Specifies a custom Kubernetes ServiceAccount for fine-grained RBAC control

These configurations are propagated to the VolumeReplicationGroup (VRG), ensuring VolSync components inherit the required privileges for successful operation.

(This content is not included.DFBUGS-3713)

3.11. Configurable memory and CPU for kube-rbac-proxy in ocs-metrics-exporter

Users can now configure memory and CPU for kube-rbac-proxy pods through the custom resource (CR). This enhancement addresses out-of-memory (OOM) issues encountered during ocs-metrics-exporter operations. By allowing resource adjustments, users can prevent running into these issues.

(This content is not included.DFBUGS-3286)

Chapter 4. Behavior changes

4.2. Warning when there is a conflict between two pods using the same volume

OpenShift now fires the following warning when there is a conflict between two pods using the same volume: selinux_warning_controller_selinux_volume_conflict.

Chapter 5. Technology previews

This section describes the technology preview features introduced in Red Hat OpenShift Data Foundation 4.20 under Technology Preview support limitations.

Important

Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend using them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

Technology Preview features are provided with a limited support scope, as detailed on the Customer Portal: Technology Preview Features Support Scope.

5.1. ARM architecture support in OpenShift Data Foundation

OpenShift Data Foundation components can now run on ARM-based clusters.

Chapter 6. Developer previews

This section describes the developer preview features introduced in Red Hat OpenShift Data Foundation 4.20.

Important

Developer preview feature is subject to Developer preview support limitations. Developer preview releases are not intended to be run in production environments. The clusters deployed with developer preview features are considered to be development clusters and are not supported through the Red Hat Customer Portal case management system. If you need assistance with developer preview features, reach out to the This content is not included.ocs-devpreview@redhat.com mailing list and a member of the Red Hat Development Team will assist you as quickly as possible based on availability and work schedules.

6.1. Erasure coding support in internal mode for RBD and CephFS

Erasure coding support for OpenShift Data Foundation (ODF) internal mode helps to enhance storage efficiency and reduce infrastructure costs. Erasure coding enables more space-efficient data protection compared to traditional replication, making it an ideal solution for customers seeking scalable and cost-effective storage.

For more information, see the knowledgebase article, RBD and CephFS Erasure Coding in Internal Mode.

6.2. Enhanced Disaster Recovery validation with odf dr

The odf dr tool provides streamlined validation for disaster recovery setups and protected applications. This tool reduces manual effort across clusters and enables faster, more automated issue detection and reporting.

6.3. Improved object quota enforcement

This update enhances control over object quotas by optimizing the timing of size calculations. Previously, enforcement relied on a 2-minute statistics cycle, which could allow excessive uploads and risk storage exhaustion. The new approach reduces latency in quota checks, enabling more responsive and accurate enforcement.

For more information, see Quota support for OBC provided by ODF Multicloud Object Gateway.

6.4. Incremental snapshot support for RBD

Support for incremental snapshots in RBD volumes improves snapshot performance and storage efficiency. Unlike full snapshots, incremental snapshots capture only the changes since the last snapshot, reducing both time and storage overhead.

For more information, see the knowledgebase article, Configure external-snapshot-metadata sidecar container for RBD deployment.

6.5. Persistent Volume health metrics

PV-level health monitoring provides detailed status for individual persistent volumes. This enhancement helps administrators to more easily detect and diagnose issues, which can improve operational visibility and reduce troubleshooting time.

For more information, see the knowledgebase article, Enabling VolumeCondition reporting for CephFS PersistentVolumeClaims.

Chapter 7. Bug fixes

This section describes the notable bug fixes introduced in Red Hat OpenShift Data Foundation 4.20.

7.1. Multicloud Object Gateway

  • Noobaa certificate verification for NamespaceStore endpoints

    Previously, missing validation of CA bundle when mounting NamespaceStore endpoints caused failures in loading and consuming provided CA bundles. Validation for CA bundles has now been added to ensure proper certificate verification.

    (This content is not included.DFBUGS-2712)

  • Support for AWS region ap-east-2 in Noobaa operator

    Previously, the ap-east-2 region was missing from the MCG operator-supported regions list, preventing creation of a default BackingStore when deployed in this region. The missing region has now been added to the supported list.

    (This content is not included.DFBUGS-2802)

  • Noobaa no longer fails to issue deletes to RGW

    A configuration change caused delays in deleting large numbers of small objects from the underlying RGW storage. This impacted performance during high-volume delete operations. The issue was resolved by reverting the configuration change, eliminating the delay in deletion from the underlying storage.

    (This content is not included.DFBUGS-2916)

7.2. Disaster recovery

  • ACM console view persistence on hard refresh

    Previously, a hard refresh from the ACM console caused the view to revert to the OCP (local-cluster) console. This was because Multicluster Orchestrator console routes were not registered properly for ACM (all clusters) view, which disrupted the expected navigation behavior. The routing logic has now been corrected, and refreshing the browser no longer changes the active view. Users remain in the ACM console as intended.

    (This content is not included.DFBUGS-4061)

  • DR status now visible for VMs

    The DR Status was missing on the VM list page, and the Remove disaster recovery option was not available when managing the VMs protected using label selectors. This happened because the UI could not correctly identify the VM’s cluster and its DRPC.

    The issue was fixed by reading the VM cluster from the correct field and improving how DRPCs are parsed when label selectors are used. Now, both the DR Status and the Remove disaster recovery options work as expected.

    (This content is not included.DFBUGS-4286)

  • Disabling DR for a CephFS application with consistency groups enabled no longer leaves some resources behind

    Disabling DR for a CephFS application with consistency groups enabled no longer leaves any resources behind. Manual cleanup is no longer required.

    (This content is not included.DFBUGS-2950)

  • s3StoreProfile in ramen-hub-operator-config after upgrade from 4.18 to 4.19

    Previously, after upgrading from 4.18 to 4.19, the ramen-hub-operator-config ConfigMap was overwritten with default values from the Ramen-hub CSV. This caused loss of custom S3Profiles and other configurations added by the Multicluster Orchestrator (MCO) operator. The issue has been fixed to preserve custom entries during upgrade, preventing disruption in S3 profile configurations.

    (This content is not included.DFBUGS-3634)

  • virtualmachines.kubevirt.io resource no longer fails restore due to mac allocation failure on relocate

    Previously, when a virtual machine was relocated back to the preferred cluster, the relocation could fail because its MAC address was unavailable. This occurred if the virtual machine was not fully cleaned up on the preferred cluster after being failed over to the failover cluster. This cleanup process has been corrected, ensuring successful relocation to the preferred cluster.

    (This content is not included.BZ#2295404)

  • Failover process no longer fails when the ReplicationDestination resource has not been created yet

    Previously, if the user initiated a failover before the LastGroupSyncTime was updated, the failover process would fail. This failure was accompanied by an error message indicating that the ReplicationDestination does not exist.

    This issue has been resolved, and failover works as expected.

    (This content is not included.DFBUGS-632)

  • After Relocation of consistency groups based workload, synchronization no longer stops

    Previously, when applications using CephRBD volumes with volume consistency groups were running and the secondary managed cluster went offline, replication for these volumes could stop indefinitely—even after the secondary cluster came back online. During this condition, the Volume SynchronizationDelay alert was triggered, starting with a Warning status and later escalating to Critical, indicating replication had ceased for the affected volumes. This issue has been resolved to ensure replication resumes automatically when the secondary cluster is restored.

    (This content is not included.DFBUGS-3812)

7.3. Rook

  • Ceph monitor endpoints fully visible

    Previously, only one of the three Ceph monitor endpoints was showing up due to missing entries in the CSI ConfigMap. This meant only one of the mons was there for fault tolerance.

    The issue was fixed by adding all monitor endpoints to the ConfigMap. Now, all mons are visible, and CSI communication is fault-tolerant.

    (This content is not included.DFBUGS-4344)

7.4. OpenShift Data Foundation console

  • Fixed StorageSystem creation wizard issues

    Previously, the Network Type field for Host was missing, resulting in empty network details and a misleading tooltip that described Multus instead of the actual host configuration. This caused confusion in the summary view, where users saw no network information and an inaccurate tooltip.

    With this update, the tooltips were removed and replaced with radio buttons featuring correct labels and descriptions.

    (This content is not included.DFBUGS-2582)

  • Force delete option restored for stuck StorageConsumer

    Previously, users were unable to forcefully delete a StorageConsumer resource if it was stuck in a deletion state due to the presence of a deletionTimeStamp.

    This issue has been resolved by updating the Actions menu to enable Delete StorageConsumer even when a deletionTimeStamp is present. As a result, you can force delete StorageConsumer resources when required.

    (This content is not included.DFBUGS-2819)

  • Fix for Disaster Recovery misconfiguration after upgrade from v4.17.z to v4.18

    Previously, the upgrade process resulted in incorrect DR resource configurations, impacting workloads that rely on ocs-storagecluster-ceph-rbd and ocs-storagecluster-ceph-rbd-virtualization storage classes.

    With this fix, the DR resources are correctly configured after the upgrade.

    (This content is not included.DFBUGS-1804)

  • Warning message in the UI right after creation of StorageCluster no longer appears

    Previously, a warning popup appeared in the UI during the creation of a StorageSystem or StorageCluster. This was caused by the Virtualization StorageClass not being annotated with storageclass.kubevirt.io/is-default-virt-class: "true", by default, after deployment.

    With this fix, the required annotation is applied automatically, preventing unnecessary warnings.

    (This content is not included.DFBUGS-2921)

  • PVC type misclassification resolved in UI

    Previously, the UI was incorrectly displaying block PVCs as filesystem PVCs due to outdated filtering method that relied on assumptions based on VRG naming conventions. This led to confusion as the PVC type was inaccurately reported.

    To address this, the filter distinguishing block and filesystem PVCs is removed, acknowledging that a group can contain both types. This change eliminates misclassification and ensures accurate representation of PVCs in the UI.

    (This content is not included.DFBUGS-4219)

  • Bucket Lifecycle rule deletion now supported

    Previously it was not possible to delete the last remaining bucket lifecycle rule due to a backend error—attempting to update the LifecycleConfiguration with empty rules triggered a 500 response.

    This has been fixed by switching to deleteBucketLifecycle for cases where the entire lifecycle configuration needs to be cleaned up. As a result, you can delete all bucket lifecycle rules without encountering errors.

    (This content is not included.DFBUGS-2960)

  • CephFS volume filtering corrected in the UI

    Previously, the UI filtering for CephFS volumes was not functioning correctly and also mistakenly excluded the CephFS PVCs when the "block" option was selected. This was due to the outdated filtering method based on VRG naming assumptions that no longer apply.

    To resolve this, the block/filesystem filter is removed, recognizing that a group might contain both types of PVCs. This fix eliminates misclassification and ensures accurate display of CephFS volumes in the UI.

    (This content is not included.DFBUGS-4065)

  • Alert for essential OpenShift Data Foundation pods down during capacity addition

    Previously, there was no test to check if the essential OpenShift Data Foundation pods were working, leading to an error when adding capacity.

    To address this issue, if essential pods are down when attempting to add capacity, the user is alerted and not allowed to proceed.

    (This content is not included.DFBUGS-1755)

  • Support external Red Hat Ceph Storage deployment on KubeVirt nodes

    Previously, on OpenShift Container Platform deployed on KubeVirt nodes, there was no option to deploy OpenShift Data Foundation with external Red Hat Ceph Storage (RHCS) due to the Infrastructure CR reporting oVirt and KubeVirt as separate platforms.

    With this fix, KubeVirt is added to the allowed list of platforms. As a result, you can create or link external RHCS storage systems from the UI.

    (This content is not included.DFBUGS-4018)

7.5. OCS operator

  • Missing Toleration for Prometheus Operator in ROSA HCP Deployments

    Previously, it was required to manually patch the pod after creation to apply tolerations.

    With this fix, an issue where the prometheus-operator pod in Red Hat OpenShift Service on AWS (ROSA) with hosted control planes (HCP) that was missing the required tolerations was fixed. The tolerations are correctly applied during deployment, eliminating the need for manual intervention.

    (This content is not included.DFBUGS-1272)

  • Service "ocs-provider-server" is invalid: spec.ports[0].nodePort: Invalid value: 31659: provided port is already allocated error no longer appears while reconciling

    Previously, the ocs-operator deployed a service using port 31659, which could conflict with an existing nodePort service that is already using the same port. This conflict caused the ocs-operator deployment to fail, resulting in upgrade reconciliation getting stuck.

    With this fix, the port allocation is handled more safely to avoid clashes with existing services.

    (This content is not included.DFBUGS-1831)

  • ocs-metrics-exporter inherits node selector

    Previously, the ocs-metrics-exporter did not inherit the node selector configuration, causing scheduling issues. This has been resolved by ensuring the node selector is properly applied, as detailed in this Red Hat Solution.

    (This content is not included.DFBUGS-3728)

7.6. Ceph monitoring

  • Clone count alert now fires promptly when 200+ clones are created

    The clone count alert was previously stuck in a Pending state and failed to fire in a timely manner when over 200 clones were created. This was caused by the alert’s firing threshold being set to 30 minutes, resulting in a long delay. To resolve this, the firing time was reduced from 30 minutes to 30 seconds. As a result, the alert now fires as expected, providing timely notifications when the clone count exceeds the threshold.

    (This content is not included.DFBUGS-3869)

  • Correct runbook URL for HighRBDCloneSnapshotCount alert

    The runbook URL linked to the 'HighRBDCloneSnapshotCount' alert was previously incorrect, leading users to a non-existent help page. This issue has been fixed by updating the alert configuration with the correct URL.

    (This content is not included.DFBUGS-3949)

Chapter 8. Known issues

This section describes the known issues in Red Hat OpenShift Data Foundation 4.20.

8.1. Disaster recovery

  • Regional-DR is not supported in environments deployed on IBM Z

    Regional-DR is not supported in OpenShift Data Foundation environments deployed on IBM Z because ACM 2.15 is not supported on this platform for this release. This impacts both new and upgraded deployments on IBM Z.

    (This content is not included.DFBUGS-5369)

  • Node crash results in kubelet service failure causing Data Foundation in error state

    An unexpected node crash in an OpenShift cluster might lead to node being stuck in NotReady state and affect storage cluster.

    Workaround:

  • Get the pending CSR:

    $ oc get csr | grep Pending
  • Approve the pending CSR:

    $ oc adm certificate approve <csr>

    (This content is not included.DFBUGS-3636)

  • CIDR range does not persist in csiaddonsnode object when the respective node is down

    When a node is down, the Classless Inter-Domain Routing (CIDR) information disappears from the csiaddonsnode object. This impacts the fencing mechanism when it is required to fence the impacted nodes.

    Workaround: Collect the CIDR information immediately after the NetworkFenceClass object is created.

    (This content is not included.DFBUGS-2948)

  • After node replacement, new mon pod is failing to schedule

    After node replacement, the new mon pod fails to schedule itself in the newly added node. As a result, mon pod is stuck in the Pending state, which impacts the storagecluster status with a mon being unavailable.

    Workaround: Manually update the new mon deployment with the correct nodeSelector.

    (This content is not included.DFBUGS-2918)

  • ceph df reports an invalid MAX AVAIL value when the cluster is in stretch mode

    When a CRUSH rule in a Red Hat Ceph Storage cluster has multiple take steps, the ceph df report shows the wrong maximum available size for associated pools.

    (This content is not included.DFBUGS-1748)

  • DRPCs protect all persistent volume claims created on the same namespace

    The namespaces that host multiple disaster recovery (DR) protected workloads protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its spec.pvcSelector field.

    This results in PVCs that match the DRPlacementControl spec.pvcSelector across multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.

    Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.

    Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.

    (This content is not included.DFBUGS-1749)

  • Disabled PeerReady flag prevents changing the action to Failover

    The DR controller executes full reconciliation as and when needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload is already relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. As a result, the disabled PeerReady flag prevents you from changing the action to Failover.

    Workaround: Use the command-line interface to change the DR action to Failover despite the disabled PeerReady flag.

    (This content is not included.DFBUGS-665)

  • Ceph becomes inaccessible and IO is paused when connection is lost between the two data centers in stretch cluster

    When two data centers lose connection with each other but are still connected to the Arbiter node, there is a flaw in the election logic that causes an infinite election among Ceph Monitors. As a result, the Monitors are unable to elect a leader and the Ceph cluster becomes unavailable. Also, IO is paused during the connection loss.

    Workaround: Shutdown the monitors of any one data zone by bringing down the zone nodes. Additionally, you can reset the connection scores of surviving Monitor pods.

    As a result, Monitors can form a quorum and Ceph becomes available again and IOs resumes.

    (This content is not included.DFBUGS-425)

  • RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster

    For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.

    Workaround: Update the rook-ceph-csi-mapping-config configmap with cephBlockPoolID’s mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.

    (This content is not included.DFBUGS-527)

  • Information about lastGroupSyncTime is lost after hub recovery for the workloads which are primary on the unavailable managed cluster

    Applications that are previously failed over to a managed cluster do not report a lastGroupSyncTime, thereby causing the trigger of the alert VolumeSynchronizationDelay. This is because when the ACM hub and a managed cluster that are part of the DRPolicy are unavailable, a new ACM hub cluster is reconstructed from the backup.

    Workaround: If the managed cluster to which the workload was failed over is unavailable, you can still failover to a surviving managed cluster.

    (This content is not included.DFBUGS-376)

  • MCO operator reconciles the veleroNamespaceSecretKeyRef and CACertificates fields

    When the OpenShift Data Foundation operator is upgraded, the CACertificates and veleroNamespaceSecretKeyRef fields under s3StoreProfiles in the Ramen config are lost.

    Workaround: If the Ramen config has the custom values for the CACertificates and veleroNamespaceSecretKeyRef fields, then set those custom values after the upgrade is performed.

    (This content is not included.DFBUGS-440)

  • For discovered apps with CephFS, sync stop after failover

    For CephFS-based workloads, synchronization of discovered applications may stop at some point after a failover or relocation. This can occur with a Permission Denied error reported in the ReplicationSource status.

    Workaround:

    • For Non-Discovered Applications

      • Delete the VolumeSnapshot:

        $ oc delete volumesnapshot -n <vrg-namespace> <volumesnapshot-name>

        The snapshot name usually starts with the PVC name followed by a timestamp.

      • Delete the VolSync Job:

        $ oc delete job -n <vrg-namespace> <pvc-name>

        The job name matches the PVC name.

    • For Discovered Applications

      Use the same steps as above, except <namespace> refers to the application workload namespace, not the VRG namespace.

    • For Workloads Using Consistency Groups

      • Delete the ReplicationGroupSource:

        $ oc delete replicationgroupsource -n <namespace> <name>
      • Delete All VolSync Jobs in that Namespace:

        $ oc delete jobs --all -n <namespace>

        In this case, <namespace> refers to the namespace of the workload (either discovered or not), and <name> refers to the name of the ReplicationGroupSource resource.

        (This content is not included.DFBUGS-2883)

  • Remove DR option is not available for discovered apps on the Virtual machines page

    The Remove DR option is not available for discovered applications listed on the Virtual machines page.

    Workaround:

    1. Add the missing label to the DRPlacementControl:

      {{oc label drplacementcontrol <drpcname> \
      odf.console.selector/resourcetype=virtualmachine \
      -n openshift-dr-ops}}
    2. Add the PROTECTED_VMS recipe parameter with the virtual machine name as its value:

      {{oc patch drplacementcontrol <drpcname> \
      -n openshift-dr-ops \
      --type='merge' \
      -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}

      (This content is not included.DFBUGS-2823)

  • DR Status is not displayed for discovered apps on the Virtual machines page

    DR Status is not displayed for discovered applications listed on the Virtual machines page.

    Workaround:

    1. Add the missing label to the DRPlacementControl:

      {{oc label drplacementcontrol <drpcname> \
      odf.console.selector/resourcetype=virtualmachine \
      -n openshift-dr-ops}}
    2. Add the PROTECTED_VMS recipe parameter with the virtual machine name as its value:

      {{oc patch drplacementcontrol <drpcname> \
      -n openshift-dr-ops \
      --type='merge' \
      -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}

      (This content is not included.DFBUGS-2822)

  • PVCs deselected after failover doesn’t cleanup the stale entries in the secondary VRG causing the subsequent relocate to fail

    If PVCs were deselected after a workload failover, and a subsequent relocate operation is performed back to the preferredCluster, stale PVCs may still be reported in VRG. As a result, the DRPC may report its Protected condition as False, with a message similar to the following:

    VolumeReplicationGroup (/) on cluster is not reporting any lastGroupSyncTime as primary, retrying till status is met.

    Workaround:

    To resolve this issue, manually clean up the stale PVCs (that is, those deselected after failover) from VRG status.

    1. Identify the stale PVCs that were deselected after failover and are no longer intended to be protected.
    2. Edit the VRG status on the ManagedCluster named <managed-cluster-name>:

      $ oc edit --subresource=status -n <vrg-namespace> <vrg-name>
    3. Remove the stale PVC entries from the status.protectedPVCs section.

      Once the stale entries are removed, the DRPC will recover and report as healthy.

      (This content is not included.DFBUGS-2932)

  • Secondary PVCs aren’t removed when DR protection is removed for discovered apps

    On the secondary cluster, CephFS PVCs linked to a workload are usually managed by the VolumeReplicationGroup (VRG). However, when a workload is discovered using the Discovered Applications feature, the associated CephFS PVCs are not marked as VRG-owned. As a result, when the workload is disabled, these PVCs are not automatically cleaned up and become orphaned.

    Workaround: To clean up the orphaned CephFS PVCs after disabling DR protection for a discovered workload, manually delete them using the following command:

    $ oc delete pvc <pvc-name> -n <pvc-namespace>

    (This content is not included.DFBUGS-2827)

  • DRPC in Relocating state after minor upgrade

    After upgrading from version 4.19 to 4.20, the DRPC (Disaster Recovery Placement Control) may enter a Relocating state. During this process, a new VGR (VolumeGroupReplication) is created with a different naming convention, resulting in two VGRs attempting to claim the same PVC. This conflict can cause temporary instability in the DRPC status.

    Workaround: Delete the old VGR (the one with the previous naming convention). The new VGR will then successfully claim the PVC, and the DRPC will return to a healthy state after some time.

    (This content is not included.DFBUGS-4450)

  • Ceph in warning state after adding capacity to cluster

    After device replacement or capacity addition it is observed that Ceph is in HEALTH_WARN state with mon reporting slow ops. However, there is no impact to the usability of the cluster.

    (This content is not included.DFBUGS-1273)

  • OSD pods restart during add capacity

    OSD pods restart after performing cluster expansion by adding capacity to the cluster. However, no impact to the cluster is observed apart from pod restarting.

    (This content is not included.DFBUGS-1426)

  • Sync stops after PVC deselection

    When a PersistentVolumeClaim (PVC) is added to or removed from a group by modifying its label to match or unmatch the group criteria, sync operations may unexpectedly stop. This occurs due to stale protected PVC entries remaining in the VolumeReplicationGroup (VRG) status.

    Workaround:

    Manually edit the VRG’s status field to remove the stale protected PVC:

    $ oc edit vrg <vrg-name> -n <vrg-namespace> --subresource=status

    (This content is not included.DFBUGS-4012)

  • PVs Remain Stuck in Released State After Workload Deletion

    After the final sync, all temporary PVs/PVCs are deleted; however, for some PVs, the persistentVolumeReclaimPolicy remains set to Retain, causing the PVs to stay in the Released state.

    Workaround:

    Edit the PVs persistentVolumeReclaimPolicy using the command:

    $ oc edit pv <pv-name>

    Change persistentVolumeReclaimPolicy to Delete. The stuck PVs will disappear.

    (This content is not included.DFBUGS-4535)

8.2. Multicloud Object Gateway

  • Unable to create new OBCs using Noobaa

    When provisioning an NSFS bucket via ObjectBucketClaim (OBC), the default filesystem path is expected to use the bucket name. However, if path is set in OBC.Spec.AdditionalConfig, it should take precedence. This behavior is currently inconsistent, resulting in failures when creating new OBCs.

    (This content is not included.DFBUGS-3817)

8.3. Ceph

  • Poor CephFS performance on stretch clusters

    Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server pods (MDS) on multi-site Data Foundation clusters.

    (This content is not included.DFBUGS-1753)

  • SELinux relabelling issue with a very high number of files

    When attaching volumes to pods in Red Hat OpenShift Container Platform, the pods sometimes do not start or take an excessive amount of time to start. This behavior is generic and it is tied to how SELinux relabelling is handled by Kubelet. This issue is observed with any filesystem based volumes having very high file counts. In OpenShift Data Foundation, the issue is seen when using CephFS based volumes with a very high number of files. There are multiple ways to work around this issue. Depending on your business needs you can choose one of the workarounds from the knowledgebase solution https://access.redhat.com/solutions/6221251.

    (This content is not included.Jira#3327)

8.4. CSI Addons

  • Orphaned CSIAddonsNode CRs leading to errant sidecar connection attempts by the csi-addons-controller-manager pod

    When a worker is deleted, a stale CSIAddonsNode owned by the DaemonSet remains in the cluster. This results in false connection attempts by csi-addons to an endpoint that no longer exists.

    Workaround: Manually identify and delete the CSIAddonsNode resources linked to the removed worker node.

    # oc get csiaddonsnodes --no-headers | awk '{print $1}' | grep -v -f <(oc get nodes --no-headers | awk '{print $1}') | xargs -r oc delete csiaddonsnode

    (This content is not included.DFBUGS-4466)

8.5. OpenShift Data Foundation console

  • UI shows "Unauthorized" error and Blank screen with loading temporarily during ODF operator installation

    During OpenShift Data Foundation operator installation, sometimes the InstallPlan transiently goes missing which causes the page to show unknown status. This does not happen regularly. As a result, the messages and title go missing for a few seconds.

    (This content is not included.DFBUGS-3574)

  • Optimize DRPC creation when multiple workloads are deployed in a single namespace

    When multiple applications refer to the same placement, then enabling DR for any of the applications enables it for all the applications that refer to the placement.

    If the applications are created after the creation of the DRPC, the PVC label selector in the DRPC might not match the labels of the newer applications.

    Workaround: In such cases, disabling DR and enabling it again with the right label selector is recommended.

    (This content is not included.DFBUGS-120)

8.6. ODF-CLI

  • ODF-CLI tools misidentify stale volumes

    Stale subvolume CLI tool misidentifies the valid CephFS persistent volume claim (PVC) as stale due to an issue in the stale subvolume identification tool. As a result, stale subvolume identification functionality will not be available till the issue is fixed.

    (This content is not included.DFBUGS-3778)