Configuring Non-Graceful Node Shutdown Handling in OpenShift Data Foundation 4.21 (Developer Preview)

Updated

Important: A developer preview feature is subject to Developer preview support limitations. Developer preview features are not intended to be run in production environments. The clusters deployed with the developer preview features are considered to be development clusters and are not supported through the Red Hat Customer Portal case management system. Development Preview features are meant for customers who are willing to evaluate new products or releases of products in an early stage of product development. If you need assistance with developer preview features, reach out to the ocs-devpreview@redhat.com mailing list and a member of the Red Hat Development Team will assist you as quickly as possible based on availability and work schedules. To know more about the support scope refer to the This content is not included.KCS.

Environment

  • Red Hat OpenShift Data Foundation 4.21

Why need this feature?

When a node becomes dysfunctional or is intentionally drained in OpenShift, the node.kubernetes.io/out-of-service taint may be applied to mark the node as unavailable (see documentation). This results in the forceful deletion of pods scheduled on the node and the cleanup of associated VolumeAttachment objects.

Without fencing enabled, the CSI driver cannot revoke the node's access to storage volumes during non-graceful shutdowns. The node may still hold active mounts, open file handles, or client sessions. This can lead to data corruption as applications may still be running on the broken node with active client sessions, even though the node is marked as out of service.

Configuration

Note: This is a Developer Preview feature in ODF 4.21 and is not enabled by default.

Configure the CSI drivers to handle out-of-service taints by fencing the node:

Procedure

  1. Enable fencing for the RBD CSI driver:

    oc edit drivers.csi.ceph.io openshift-storage.rbd.csi.ceph.com -n openshift-storage
    

    Add the enableFencing field under the spec section:

    spec:
      enableFencing: true
    
  2. Enable fencing for the CephFS CSI driver:

    oc edit drivers.csi.ceph.io openshift-storage.cephfs.csi.ceph.com -n openshift-storage
    

    Add the enableFencing field under the spec section:

    spec:
      enableFencing: true
    
  3. Wait for the CSI pod rollout to complete:

    Monitor the rollout status for both drivers. The CSI controller and node plugin pods will be restarted automatically.

  4. Verify that the --enable-fencing=true flag is set on the CSI deployment and daemonset pods for each driver:

    For RBD driver:

       oc get deployment openshift-storage.rbd.csi.ceph.com-ctrlplugin -n openshift-storage -oyaml | grep "enable-fencing"
               - --enable-fencing=true
    
       oc get daemonsets.apps openshift-storage.rbd.csi.ceph.com-nodeplugin -n openshift-storage -oyaml | grep "enable-fencing"
               - --enable-fencing=true
    

    For CephFS driver:

       oc get deployment openshift-storage.cephfs.csi.ceph.com-ctrlplugin -n openshift-storage -oyaml | grep "enable-fencing"
               - --enable-fencing=true
    
       oc get daemonsets.apps openshift-storage.cephfs.csi.ceph.com-nodeplugin -n openshift-storage -oyaml | grep "enable-fencing"
               - --enable-fencing=true
    

Expected Outcome

Once fencing is enabled, when a node is marked with the node.kubernetes.io/out-of-service taint:

  • The CSI driver will fence the node, preventing it from accessing Ceph storage.
  • Applications on the failed node can no longer write to volumes, preventing data corruption.
  • RWO volumes can be safely mounted on healthy nodes, allowing workloads to recover.
  • VolumeAttachment objects will be cleaned up safely.

Troubleshooting

Issue: The --enable-fencing=true flag is not present in the pod specifications after editing the Driver CR.

Resolution:

  • Verify that the Driver CR was saved correctly:

     oc get drivers.csi.ceph.io openshift-storage.rbd.csi.ceph.com -n openshift-storage -o yaml | grep enableFencing
    
  • Check the CSI operator logs for any errors:

     oc logs -n openshift-storage ceph-csi-controller-manager-6bf5d985cb-58n4v
    
  • Ensure all CSI pods have restarted. Delete them manually if necessary to trigger a restart.

Article Type