4.20 Release Notes
Release notes for features and enhancements, known issues, and other important information.
Abstract
Chapter 1. Overview
Red Hat OpenShift Data Foundation is software-defined storage that is optimized for container environments. It runs as an operator on OpenShift Container Platform to provide highly integrated and simplified persistent storage management for containers.
Red Hat OpenShift Data Foundation is integrated into the latest Red Hat OpenShift Container Platform to address platform services, application portability, and persistence challenges. It provides a highly scalable backend for the next generation of cloud-native applications, built on a technology stack that includes Red Hat Ceph Storage, the Rook.io Operator, and NooBaa’s Multicloud Object Gateway technology.
Red Hat OpenShift Data Foundation is designed for FIPS. When running on RHEL or RHEL CoreOS booted in FIPS mode, OpenShift Container Platform core components use the RHEL cryptographic libraries submitted to NIST for FIPS Validation on only the x86_64, ppc64le, and s390X architectures. For more information about the NIST validation program, see Content from csrc.nist.gov is not included.Cryptographic Module Validation Program. For the latest NIST status for the individual versions of the RHEL cryptographic libraries submitted for validation, see This content is not included.Compliance Activities and Government Standards.
Red Hat OpenShift Data Foundation provides a trusted, enterprise-grade application development environment that simplifies and enhances the user experience across the application lifecycle in a number of ways:
- Provides block storage for databases.
- Shared file storage for continuous integration, messaging, and data aggregation.
- Object storage for cloud-first development, archival, backup, and media storage.
- Scale applications and data exponentially.
- Attach and detach persistent data volumes at an accelerated rate.
- Stretch clusters across multiple data-centers or availability zones.
- Establish a comprehensive application container registry.
- Support the next generation of OpenShift workloads such as Data Analytics, Artificial Intelligence, Machine Learning, Deep Learning, and Internet of Things (IoT).
- Dynamically provision not only application containers, but data service volumes and containers, as well as additional OpenShift Container Platform nodes, Elastic Block Store (EBS) volumes and other infrastructure services.
1.1. About this release
Red Hat OpenShift Data Foundation 4.20 is now available. New enhancements, features, and known issues that pertain to OpenShift Data Foundation 4.20 are included in this topic.
Red Hat OpenShift Data Foundation 4.20 is supported on the Red Hat OpenShift Container Platform version 4.20. For more information, see This content is not included.Red Hat OpenShift Data Foundation Supportability and Interoperability Checker.
For Red Hat OpenShift Data Foundation life cycle information, refer to This content is not included.Product Life Cycles.
1.2. Important notice regarding Regional Disaster Recovery
Do not enroll new workloads for DR protection using Regional Disaster Recovery (RDR) in OpenShift Data Foundation 4.20.0.
It is recommended to wait for a future release where this issue will be resolved.
Chapter 2. New features
This section describes new features introduced in Red Hat OpenShift Data Foundation 4.20.
2.1. Recipes with exec hooks for Disaster Recovery workloads
You can now use recipes with exec hooks in DR workloads. This expands support for a broader range of workloads by enabling more flexible and dynamic execution during DR operations.
2.2. Independent virtual machine (VM) DR control within namespaces
You can now perform failover and failback operations on individual VMs within a namespace, rather than being limited to managing all DR-protected VMs at once. This is valuable for scenarios like load balancing across clusters, where customers may want to migrate only a subset of VMs in a given namepsace.
Chapter 3. Enhancements
This section describes the major enhancements introduced in Red Hat OpenShift Data foundation 4.20.
3.1. Support forceful deployment of ODF
A new flag has been added to enable forceful deployment of the storage cluster. This helps redeployment with automation.
For more information, see the knowledgebase article Install Red Hat OpenShift Data Foundation 4.X in internal-attached mode using command line interface..
3.2. ODF Multus Support with IPv6
Multus networking feature is enhanced to support IPv6 in addition to IPv4. Either IPv4 or IPv6 can be configured for the Multus network.
For more information, see Multus architecture for OpenShift Data Foundation.
3.3. Automated key rotation and support for missing KMS in encrypted StorageClasses
With this release, annotations required for encrypted StorageClasses are now automatically added when Key Management Services (KMS) are missing. This streamlines key rotation and improves security configuration.
3.4. Pool level health status
Pool level alerts for near full and full status of the pool are now available with actionable messages.
For more information, see Resolving cluster alerts.
3.5. Multicloud Object Gateway
3.5.1. Unified CLI: mcg-cli capabilities integrated into odf-cli
Multicloud Object Gateway (MCG) commands are now available through the odf-cli utility. This enhancement consolidates ODF, Ceph, and MCG operations into a single command-line interface, eliminating the need to download and manage multiple binaries.
3.5.2. Public access limit option for S3 resources in MCG object browser
A new option is available in the MCG object browser to configure public access limits for S3 resources. This enhancement improves control over data exposure and strengthens security for object storage.
For more information about this procedure, see Setting up public access limit to S3 resources using MCG object browser.
3.5.3. Option to disable external access routes to MCG
A new configuration option to disable all routes that enable external access to the Multicloud Object Gateway (MCG). This feature helps ensure that MCG services are only accessible within the OpenShift environment.
For more information, see Securing Multicloud Object Gateway.
3.5.4. Bucket-Level metrics for replication state
New metrics have been introduced to provide detailed visibility into the replication progress of the buckets. These metrics help determine data safety and availability on the secondary site. The following metrics are now available per bucket, per replication cycle:
- Total number of objects scanned
- Number of objects successfully replicated
- Number of objects that failed to replicate
For more information, see Obtaining metrics to reflect bucket replication state.
3.5.5. MCG introduces Metrics and AlertRule for detecting failures relating to noobaa-db
There is a new alert to detect failures relating to noobaa-db. This helps to improve the health visibility of NooBaa DB.
This helps to identify internal failures when there is a disconnect between the components while a component by itself is in ready state.
For more information, see Resolving alerts and errors.
3.6. Alert triggered when CSI clones near soft limit
An alert is triggered when CSI clones or snapshots approach the soft limit of 200. This notification recommends switching to volume snapshot cloning for better performance, helping to avoid delays caused by excessive clone operations.
3.7. Improvements to disaster recovery uninstall workflow
This release introduces improvements to the DR uninstall workflow, streamlining the removal of resources created during various stages of DR deployment. Previously, uninstalling DR components required manual cleanup across multiple layers. With this enhancement, the uninstall process is more intuitive and automated.
3.8. Smarter resource merging for component configuration
Resource requirements for OpenShift Data Foundation components are now merged with default values instead of being fully replaced. If a user specifies only one resource type such as memory, OpenShift Data Foundation will automatically apply default values for the other type, like CPU. Previously, partial specifications caused missing fields to be dropped, leading to components running without complete resource settings and resulting in unpredictable performance. This enhancement ensures safer and more balanced configurations with minimal user input.
3.9. Selective merging of placement configurations
Placement configuration for components has been improved to support selective merging. Previously, specifying any placement section such as, Tolerations would override the entire default placement, leaving other sections like Node Affinity or Topology Spread Constraints empty. This caused incomplete or suboptimal placement configurations.
With this enhancement, OpenShift Data Foundation now merges user-defined placement values with the default configuration. Users can specify only the sections they want to customize, while defaults for other placement types are preserved, ensuring consistent and expected placement behavior for components.
3.10. Support custom SCCs for VolSync DataMover pods
VolSync DataMover pods could not access data when using custom Security Context Constraints (SCCs), leading to sync failures in clusters with custom configurations. The DRPC spec now includes a VolSyncSpec field that allows users to configure the following:
-
MoverSecurityContext: Defines the PodSecurityContext for DataMover pods -
MoverServiceAccount: Specifies a custom Kubernetes ServiceAccount for fine-grained RBAC control
These configurations are propagated to the VolumeReplicationGroup (VRG), ensuring VolSync components inherit the required privileges for successful operation.
3.11. Configurable memory and CPU for kube-rbac-proxy in ocs-metrics-exporter
Users can now configure memory and CPU for kube-rbac-proxy pods through the custom resource (CR). This enhancement addresses out-of-memory (OOM) issues encountered during ocs-metrics-exporter operations. By allowing resource adjustments, users can prevent running into these issues.
Chapter 4. Behavior changes
4.2. Warning when there is a conflict between two pods using the same volume
OpenShift now fires the following warning when there is a conflict between two pods using the same volume: selinux_warning_controller_selinux_volume_conflict.
Chapter 5. Technology previews
This section describes the technology preview features introduced in Red Hat OpenShift Data Foundation 4.20 under Technology Preview support limitations.
Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend using them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Technology Preview features are provided with a limited support scope, as detailed on the Customer Portal: Technology Preview Features Support Scope.
5.1. ARM architecture support in OpenShift Data Foundation
OpenShift Data Foundation components can now run on ARM-based clusters.
Chapter 6. Developer previews
This section describes the developer preview features introduced in Red Hat OpenShift Data Foundation 4.20.
Developer preview feature is subject to Developer preview support limitations. Developer preview releases are not intended to be run in production environments. The clusters deployed with developer preview features are considered to be development clusters and are not supported through the Red Hat Customer Portal case management system. If you need assistance with developer preview features, reach out to the This content is not included.ocs-devpreview@redhat.com mailing list and a member of the Red Hat Development Team will assist you as quickly as possible based on availability and work schedules.
6.1. Erasure coding support in internal mode for RBD and CephFS
Erasure coding support for OpenShift Data Foundation (ODF) internal mode helps to enhance storage efficiency and reduce infrastructure costs. Erasure coding enables more space-efficient data protection compared to traditional replication, making it an ideal solution for customers seeking scalable and cost-effective storage.
For more information, see the knowledgebase article, RBD and CephFS Erasure Coding in Internal Mode.
6.2. Enhanced Disaster Recovery validation with odf dr
The odf dr tool provides streamlined validation for disaster recovery setups and protected applications. This tool reduces manual effort across clusters and enables faster, more automated issue detection and reporting.
6.3. Improved object quota enforcement
This update enhances control over object quotas by optimizing the timing of size calculations. Previously, enforcement relied on a 2-minute statistics cycle, which could allow excessive uploads and risk storage exhaustion. The new approach reduces latency in quota checks, enabling more responsive and accurate enforcement.
For more information, see Quota support for OBC provided by ODF Multicloud Object Gateway.
6.4. Incremental snapshot support for RBD
Support for incremental snapshots in RBD volumes improves snapshot performance and storage efficiency. Unlike full snapshots, incremental snapshots capture only the changes since the last snapshot, reducing both time and storage overhead.
For more information, see the knowledgebase article, Configure external-snapshot-metadata sidecar container for RBD deployment.
6.5. Persistent Volume health metrics
PV-level health monitoring provides detailed status for individual persistent volumes. This enhancement helps administrators to more easily detect and diagnose issues, which can improve operational visibility and reduce troubleshooting time.
For more information, see the knowledgebase article, Enabling VolumeCondition reporting for CephFS PersistentVolumeClaims.
Chapter 7. Bug fixes
This section describes the notable bug fixes introduced in Red Hat OpenShift Data Foundation 4.20.
7.1. Multicloud Object Gateway
Noobaa certificate verification for NamespaceStore endpoints
Previously, missing validation of CA bundle when mounting NamespaceStore endpoints caused failures in loading and consuming provided CA bundles. Validation for CA bundles has now been added to ensure proper certificate verification.
Support for AWS region
ap-east-2in Noobaa operatorPreviously, the
ap-east-2region was missing from the MCG operator-supported regions list, preventing creation of a default BackingStore when deployed in this region. The missing region has now been added to the supported list.
Noobaa no longer fails to issue deletes to RGW
A configuration change caused delays in deleting large numbers of small objects from the underlying RGW storage. This impacted performance during high-volume delete operations. The issue was resolved by reverting the configuration change, eliminating the delay in deletion from the underlying storage.
7.2. Disaster recovery
ACM console view persistence on hard refresh
Previously, a hard refresh from the ACM console caused the view to revert to the OCP (local-cluster) console. This was because Multicluster Orchestrator console routes were not registered properly for ACM (all clusters) view, which disrupted the expected navigation behavior. The routing logic has now been corrected, and refreshing the browser no longer changes the active view. Users remain in the ACM console as intended.
DR status now visible for VMs
The DR Status was missing on the VM list page, and the Remove disaster recovery option was not available when managing the VMs protected using label selectors. This happened because the UI could not correctly identify the VM’s cluster and its DRPC.
The issue was fixed by reading the VM cluster from the correct field and improving how DRPCs are parsed when label selectors are used. Now, both the DR Status and the Remove disaster recovery options work as expected.
Disabling DR for a CephFS application with consistency groups enabled no longer leaves some resources behind
Disabling DR for a CephFS application with consistency groups enabled no longer leaves any resources behind. Manual cleanup is no longer required.
s3StoreProfile in ramen-hub-operator-config after upgrade from 4.18 to 4.19
Previously, after upgrading from 4.18 to 4.19, the
ramen-hub-operator-configConfigMap was overwritten with default values from the Ramen-hub CSV. This caused loss of custom S3Profiles and other configurations added by the Multicluster Orchestrator (MCO) operator. The issue has been fixed to preserve custom entries during upgrade, preventing disruption in S3 profile configurations.
virtualmachines.kubevirt.ioresource no longer fails restore due to mac allocation failure on relocatePreviously, when a virtual machine was relocated back to the preferred cluster, the relocation could fail because its MAC address was unavailable. This occurred if the virtual machine was not fully cleaned up on the preferred cluster after being failed over to the failover cluster. This cleanup process has been corrected, ensuring successful relocation to the preferred cluster.
Failover process no longer fails when the
ReplicationDestinationresource has not been created yetPreviously, if the user initiated a failover before the
LastGroupSyncTimewas updated, the failover process would fail. This failure was accompanied by an error message indicating that theReplicationDestinationdoes not exist.This issue has been resolved, and failover works as expected.
After Relocation of consistency groups based workload, synchronization no longer stops
Previously, when applications using CephRBD volumes with volume consistency groups were running and the secondary managed cluster went offline, replication for these volumes could stop indefinitely—even after the secondary cluster came back online. During this condition, the
Volume SynchronizationDelayalert was triggered, starting with aWarningstatus and later escalating toCritical, indicating replication had ceased for the affected volumes. This issue has been resolved to ensure replication resumes automatically when the secondary cluster is restored.
7.3. Rook
Ceph monitor endpoints fully visible
Previously, only one of the three Ceph monitor endpoints was showing up due to missing entries in the CSI ConfigMap. This meant only one of the mons was there for fault tolerance.
The issue was fixed by adding all monitor endpoints to the ConfigMap. Now, all mons are visible, and CSI communication is fault-tolerant.
7.4. OpenShift Data Foundation console
Fixed StorageSystem creation wizard issues
Previously, the Network Type field for Host was missing, resulting in empty network details and a misleading tooltip that described Multus instead of the actual host configuration. This caused confusion in the summary view, where users saw no network information and an inaccurate tooltip.
With this update, the tooltips were removed and replaced with radio buttons featuring correct labels and descriptions.
Force delete option restored for stuck StorageConsumer
Previously, users were unable to forcefully delete a
StorageConsumerresource if it was stuck in a deletion state due to the presence of adeletionTimeStamp.This issue has been resolved by updating the Actions menu to enable Delete StorageConsumer even when a
deletionTimeStampis present. As a result, you can force deleteStorageConsumerresources when required.
Fix for Disaster Recovery misconfiguration after upgrade from v4.17.z to v4.18
Previously, the upgrade process resulted in incorrect DR resource configurations, impacting workloads that rely on
ocs-storagecluster-ceph-rbdandocs-storagecluster-ceph-rbd-virtualizationstorage classes.With this fix, the DR resources are correctly configured after the upgrade.
Warning message in the UI right after creation of StorageCluster no longer appears
Previously, a warning popup appeared in the UI during the creation of a StorageSystem or StorageCluster. This was caused by the Virtualization StorageClass not being annotated with
storageclass.kubevirt.io/is-default-virt-class: "true", by default, after deployment.With this fix, the required annotation is applied automatically, preventing unnecessary warnings.
PVC type misclassification resolved in UI
Previously, the UI was incorrectly displaying block PVCs as filesystem PVCs due to outdated filtering method that relied on assumptions based on VRG naming conventions. This led to confusion as the PVC type was inaccurately reported.
To address this, the filter distinguishing block and filesystem PVCs is removed, acknowledging that a group can contain both types. This change eliminates misclassification and ensures accurate representation of PVCs in the UI.
Bucket Lifecycle rule deletion now supported
Previously it was not possible to delete the last remaining bucket lifecycle rule due to a backend error—attempting to update the
LifecycleConfigurationwith empty rules triggered a 500 response.This has been fixed by switching to
deleteBucketLifecyclefor cases where the entire lifecycle configuration needs to be cleaned up. As a result, you can delete all bucket lifecycle rules without encountering errors.
CephFS volume filtering corrected in the UI
Previously, the UI filtering for CephFS volumes was not functioning correctly and also mistakenly excluded the CephFS PVCs when the "block" option was selected. This was due to the outdated filtering method based on VRG naming assumptions that no longer apply.
To resolve this, the block/filesystem filter is removed, recognizing that a group might contain both types of PVCs. This fix eliminates misclassification and ensures accurate display of CephFS volumes in the UI.
Alert for essential OpenShift Data Foundation pods down during capacity addition
Previously, there was no test to check if the essential OpenShift Data Foundation pods were working, leading to an error when adding capacity.
To address this issue, if essential pods are down when attempting to add capacity, the user is alerted and not allowed to proceed.
Support external Red Hat Ceph Storage deployment on KubeVirt nodes
Previously, on OpenShift Container Platform deployed on KubeVirt nodes, there was no option to deploy OpenShift Data Foundation with external Red Hat Ceph Storage (RHCS) due to the Infrastructure CR reporting
oVirtandKubeVirtas separate platforms.With this fix,
KubeVirtis added to the allowed list of platforms. As a result, you can create or link external RHCS storage systems from the UI.
7.5. OCS operator
Missing Toleration for Prometheus Operator in ROSA HCP Deployments
Previously, it was required to manually patch the pod after creation to apply tolerations.
With this fix, an issue where the prometheus-operator pod in Red Hat OpenShift Service on AWS (ROSA) with hosted control planes (HCP) that was missing the required tolerations was fixed. The tolerations are correctly applied during deployment, eliminating the need for manual intervention.
Service "ocs-provider-server" is invalid: spec.ports[0].nodePort: Invalid value: 31659: provided port is already allocatederror no longer appears while reconcilingPreviously, the
ocs-operatordeployed a service using port 31659, which could conflict with an existing nodePort service that is already using the same port. This conflict caused theocs-operatordeployment to fail, resulting in upgrade reconciliation getting stuck.With this fix, the port allocation is handled more safely to avoid clashes with existing services.
ocs-metrics-exporterinherits node selectorPreviously, the
ocs-metrics-exporterdid not inherit the node selector configuration, causing scheduling issues. This has been resolved by ensuring the node selector is properly applied, as detailed in this Red Hat Solution.
7.6. Ceph monitoring
Clone count alert now fires promptly when 200+ clones are created
The clone count alert was previously stuck in a
Pendingstate and failed to fire in a timely manner when over 200 clones were created. This was caused by the alert’s firing threshold being set to 30 minutes, resulting in a long delay. To resolve this, the firing time was reduced from 30 minutes to 30 seconds. As a result, the alert now fires as expected, providing timely notifications when the clone count exceeds the threshold.
Correct runbook URL for
HighRBDCloneSnapshotCountalertThe runbook URL linked to the 'HighRBDCloneSnapshotCount' alert was previously incorrect, leading users to a non-existent help page. This issue has been fixed by updating the alert configuration with the correct URL.
Chapter 8. Known issues
This section describes the known issues in Red Hat OpenShift Data Foundation 4.20.
8.1. Disaster recovery
Regional-DR is not supported in environments deployed on IBM Z
Regional-DR is not supported in OpenShift Data Foundation environments deployed on IBM Z because ACM 2.15 is not supported on this platform for this release. This impacts both new and upgraded deployments on IBM Z.
Node crash results in kubelet service failure causing Data Foundation in error state
An unexpected node crash in an OpenShift cluster might lead to node being stuck in
NotReadystate and affect storage cluster.Workaround:
Get the pending CSR:
$ oc get csr | grep Pending
Approve the pending CSR:
$ oc adm certificate approve <csr>
CIDR range does not persist in
csiaddonsnodeobject when the respective node is downWhen a node is down, the Classless Inter-Domain Routing (CIDR) information disappears from the
csiaddonsnodeobject. This impacts the fencing mechanism when it is required to fence the impacted nodes.Workaround: Collect the CIDR information immediately after the
NetworkFenceClassobject is created.
After node replacement, new mon pod is failing to schedule
After node replacement, the new
monpod fails to schedule itself in the newly added node. As a result,monpod is stuck in thePendingstate, which impacts the storagecluster status with amonbeing unavailable.Workaround: Manually update the new mon deployment with the correct
nodeSelector.
ceph dfreports an invalidMAX AVAILvalue when the cluster is in stretch modeWhen a CRUSH rule in a Red Hat Ceph Storage cluster has multiple
takesteps, theceph dfreport shows the wrong maximum available size for associated pools.
DRPCs protect all persistent volume claims created on the same namespace
The namespaces that host multiple disaster recovery (DR) protected workloads protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its
spec.pvcSelectorfield.This results in PVCs that match the DRPlacementControl
spec.pvcSelectoracross multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl
spec.pvcSelectorto disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify thespec.pvcSelectorfield for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.
Disabled
PeerReadyflag prevents changing the action to FailoverThe DR controller executes full reconciliation as and when needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload is already relocated, this sanity check causes the
PeerReadyflag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. As a result, the disabledPeerReadyflag prevents you from changing the action to Failover.Workaround: Use the command-line interface to change the DR action to Failover despite the disabled
PeerReadyflag.
Ceph becomes inaccessible and IO is paused when connection is lost between the two data centers in stretch cluster
When two data centers lose connection with each other but are still connected to the Arbiter node, there is a flaw in the election logic that causes an infinite election among Ceph Monitors. As a result, the Monitors are unable to elect a leader and the Ceph cluster becomes unavailable. Also, IO is paused during the connection loss.
Workaround: Shutdown the monitors of any one data zone by bringing down the zone nodes. Additionally, you can reset the connection scores of surviving Monitor pods.
As a result, Monitors can form a quorum and Ceph becomes available again and IOs resumes.
RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster
For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.
Workaround: Update the
rook-ceph-csi-mapping-configconfigmap with cephBlockPoolID’s mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
Information about
lastGroupSyncTimeis lost after hub recovery for the workloads which are primary on the unavailable managed clusterApplications that are previously failed over to a managed cluster do not report a
lastGroupSyncTime, thereby causing the trigger of the alertVolumeSynchronizationDelay. This is because when the ACM hub and a managed cluster that are part of the DRPolicy are unavailable, a new ACM hub cluster is reconstructed from the backup.Workaround: If the managed cluster to which the workload was failed over is unavailable, you can still failover to a surviving managed cluster.
MCO operator reconciles the
veleroNamespaceSecretKeyRefandCACertificatesfieldsWhen the OpenShift Data Foundation operator is upgraded, the
CACertificatesandveleroNamespaceSecretKeyReffields unders3StoreProfilesin the Ramen config are lost.Workaround: If the Ramen config has the custom values for the
CACertificatesandveleroNamespaceSecretKeyReffields, then set those custom values after the upgrade is performed.
For discovered apps with CephFS, sync stop after failover
For CephFS-based workloads, synchronization of discovered applications may stop at some point after a failover or relocation. This can occur with a
Permission Deniederror reported in theReplicationSourcestatus.Workaround:
For Non-Discovered Applications
Delete the VolumeSnapshot:
$ oc delete volumesnapshot -n <vrg-namespace> <volumesnapshot-name>
The snapshot name usually starts with the PVC name followed by a timestamp.
Delete the VolSync Job:
$ oc delete job -n <vrg-namespace> <pvc-name>
The job name matches the PVC name.
For Discovered Applications
Use the same steps as above, except
<namespace>refers to the application workload namespace, not the VRG namespace.For Workloads Using Consistency Groups
Delete the ReplicationGroupSource:
$ oc delete replicationgroupsource -n <namespace> <name>
Delete All VolSync Jobs in that Namespace:
$ oc delete jobs --all -n <namespace>
In this case,
<namespace>refers to the namespace of the workload (either discovered or not), and<name>refers to the name of the ReplicationGroupSource resource.
Remove DR option is not available for discovered apps on the Virtual machines page
The Remove DR option is not available for discovered applications listed on the Virtual machines page.
Workaround:
Add the missing label to the DRPlacementControl:
{{oc label drplacementcontrol <drpcname> \ odf.console.selector/resourcetype=virtualmachine \ -n openshift-dr-ops}}Add the
PROTECTED_VMSrecipe parameter with the virtual machine name as its value:{{oc patch drplacementcontrol <drpcname> \ -n openshift-dr-ops \ --type='merge' \ -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}
DR Status is not displayed for discovered apps on the Virtual machines page
DR Status is not displayed for discovered applications listed on the Virtual machines page.
Workaround:
Add the missing label to the DRPlacementControl:
{{oc label drplacementcontrol <drpcname> \ odf.console.selector/resourcetype=virtualmachine \ -n openshift-dr-ops}}Add the
PROTECTED_VMSrecipe parameter with the virtual machine name as its value:{{oc patch drplacementcontrol <drpcname> \ -n openshift-dr-ops \ --type='merge' \ -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}
PVCs deselected after failover doesn’t cleanup the stale entries in the secondary VRG causing the subsequent relocate to fail
If PVCs were deselected after a workload failover, and a subsequent relocate operation is performed back to the preferredCluster, stale PVCs may still be reported in VRG. As a result, the DRPC may report its
Protectedcondition asFalse, with a message similar to the following:VolumeReplicationGroup (/) on cluster is not reporting any lastGroupSyncTime as primary, retrying till status is met.Workaround:
To resolve this issue, manually clean up the stale PVCs (that is, those deselected after failover) from VRG status.
- Identify the stale PVCs that were deselected after failover and are no longer intended to be protected.
Edit the VRG status on the ManagedCluster named <managed-cluster-name>:
$ oc edit --subresource=status -n <vrg-namespace> <vrg-name>
Remove the stale PVC entries from the
status.protectedPVCssection.Once the stale entries are removed, the DRPC will recover and report as healthy.
Secondary PVCs aren’t removed when DR protection is removed for discovered apps
On the secondary cluster, CephFS PVCs linked to a workload are usually managed by the VolumeReplicationGroup (VRG). However, when a workload is discovered using the Discovered Applications feature, the associated CephFS PVCs are not marked as VRG-owned. As a result, when the workload is disabled, these PVCs are not automatically cleaned up and become orphaned.
Workaround: To clean up the orphaned CephFS PVCs after disabling DR protection for a discovered workload, manually delete them using the following command:
$ oc delete pvc <pvc-name> -n <pvc-namespace>
DRPC in Relocating state after minor upgrade
After upgrading from version 4.19 to 4.20, the DRPC (Disaster Recovery Placement Control) may enter a Relocating state. During this process, a new VGR (VolumeGroupReplication) is created with a different naming convention, resulting in two VGRs attempting to claim the same PVC. This conflict can cause temporary instability in the DRPC status.
Workaround: Delete the old VGR (the one with the previous naming convention). The new VGR will then successfully claim the PVC, and the DRPC will return to a healthy state after some time.
Ceph in warning state after adding capacity to cluster
After device replacement or capacity addition it is observed that Ceph is in
HEALTH_WARNstate with mon reporting slow ops. However, there is no impact to the usability of the cluster.
OSD pods restart during add capacity
OSD pods restart after performing cluster expansion by adding capacity to the cluster. However, no impact to the cluster is observed apart from pod restarting.
Sync stops after PVC deselection
When a PersistentVolumeClaim (PVC) is added to or removed from a group by modifying its label to match or unmatch the group criteria, sync operations may unexpectedly stop. This occurs due to stale protected PVC entries remaining in the VolumeReplicationGroup (VRG) status.
Workaround:
Manually edit the VRG’s status field to remove the stale protected PVC:
$ oc edit vrg <vrg-name> -n <vrg-namespace> --subresource=status
PVs Remain Stuck in Released State After Workload Deletion
After the final sync, all temporary PVs/PVCs are deleted; however, for some PVs, the
persistentVolumeReclaimPolicyremains set toRetain, causing the PVs to stay in theReleasedstate.Workaround:
Edit the PVs
persistentVolumeReclaimPolicyusing the command:$ oc edit pv <pv-name>
Change
persistentVolumeReclaimPolicytoDelete. The stuck PVs will disappear.
8.2. Multicloud Object Gateway
Unable to create new OBCs using Noobaa
When provisioning an NSFS bucket via ObjectBucketClaim (OBC), the default filesystem path is expected to use the bucket name. However, if path is set in
OBC.Spec.AdditionalConfig, it should take precedence. This behavior is currently inconsistent, resulting in failures when creating new OBCs.
8.3. Ceph
Poor CephFS performance on stretch clusters
Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server pods (MDS) on multi-site Data Foundation clusters.
SELinux relabelling issue with a very high number of files
When attaching volumes to pods in Red Hat OpenShift Container Platform, the pods sometimes do not start or take an excessive amount of time to start. This behavior is generic and it is tied to how SELinux relabelling is handled by Kubelet. This issue is observed with any filesystem based volumes having very high file counts. In OpenShift Data Foundation, the issue is seen when using CephFS based volumes with a very high number of files. There are multiple ways to work around this issue. Depending on your business needs you can choose one of the workarounds from the knowledgebase solution https://access.redhat.com/solutions/6221251.
8.4. CSI Addons
Orphaned CSIAddonsNode CRs leading to errant sidecar connection attempts by the csi-addons-controller-manager pod
When a worker is deleted, a stale
CSIAddonsNodeowned by the DaemonSet remains in the cluster. This results in false connection attempts by csi-addons to an endpoint that no longer exists.Workaround: Manually identify and delete the
CSIAddonsNoderesources linked to the removed worker node.# oc get csiaddonsnodes --no-headers | awk '{print $1}' | grep -v -f <(oc get nodes --no-headers | awk '{print $1}') | xargs -r oc delete csiaddonsnode
8.5. OpenShift Data Foundation console
UI shows "Unauthorized" error and Blank screen with loading temporarily during ODF operator installation
During OpenShift Data Foundation operator installation, sometimes the
InstallPlantransiently goes missing which causes the page to show unknown status. This does not happen regularly. As a result, the messages and title go missing for a few seconds.
Optimize DRPC creation when multiple workloads are deployed in a single namespace
When multiple applications refer to the same placement, then enabling DR for any of the applications enables it for all the applications that refer to the placement.
If the applications are created after the creation of the DRPC, the PVC label selector in the DRPC might not match the labels of the newer applications.
Workaround: In such cases, disabling DR and enabling it again with the right label selector is recommended.
8.6. ODF-CLI
ODF-CLI tools misidentify stale volumes
Stale subvolume CLI tool misidentifies the valid CephFS persistent volume claim (PVC) as stale due to an issue in the stale subvolume identification tool. As a result, stale subvolume identification functionality will not be available till the issue is fixed.