How to deploy Ceph CSI Driver controllerPlugin and nodePlugin pods on infrastructure nodes in OpenShift Data Foundation 4.19+

Solution Unverified - Updated

Environment

  • Red Hat OpenShift Container Platform 4.19+
  • Red Hat OpenShift Data Foundation 4.19+
  • Ceph CSI Operator with csi.ceph.io/v1 Driver custom resources
  • Namespace: openshift-storage

Issue

How can the Ceph CSI Driver pods be scheduled on OpenShift infrastructure nodes?

For example, an administrator wants to place the following CSI driver pod types on nodes labeled as node-role.kubernetes.io/infra:

  • controllerPlugin pods, such as CSI provisioner, attacher, resizer, snapshotter, and other controller sidecars
  • nodePlugin pods, such as the CSI node plugin DaemonSet pods, only when the storage design requires the CSI node plugin to run only on infrastructure nodes

The administrator might already have configured StorageCluster.spec.placement, but the CSI driver pods are still not placed on infrastructure nodes.

Resolution

The CSI driver pods are controlled by the csi.ceph.io/v1 Driver custom resources. Configure the controllerPlugin placement under the Driver CR, not only under the StorageCluster CR. Configure nodePlugin placement only when the impact on all nodes that consume CSI volumes has been reviewed.

1. Confirm the Driver CRs

oc -n openshift-storage get drivers.csi.ceph.io

Example output:

NAME                                    AGE
openshift-storage.cephfs.csi.ceph.com   31m
openshift-storage.rbd.csi.ceph.com      31m

Example output might include one or more of the following Driver CRs:

openshift-storage.rbd.csi.ceph.com
openshift-storage.cephfs.csi.ceph.com
openshift-storage.nfs.csi.ceph.com

Each Driver CR controls its own CSI controller and node plugin resources. If both RBD and CephFS volumes are used in the environment, verify and configure both openshift-storage.rbd.csi.ceph.com and openshift-storage.cephfs.csi.ceph.com.

For this article, the RBD driver is used as the main example:

oc get driver openshift-storage.rbd.csi.ceph.com -n openshift-storage -o yaml

For the CephFS driver, use the CephFS Driver CR name instead:

oc get driver openshift-storage.cephfs.csi.ceph.com -n openshift-storage -o yaml

2. Confirm the infrastructure node label

The target nodes must have the infrastructure node label:

oc get nodes -l node-role.kubernetes.io/infra --show-labels

If the label is missing, add it to the intended infrastructure nodes:

oc label node <node_name> node-role.kubernetes.io/infra=""

If the infrastructure nodes are tainted with NoSchedule, the CSI pods also need a matching toleration. For example:

oc adm taint node <node_name> node-role.kubernetes.io/infra=:NoSchedule

3. Configure controllerPlugin node affinity and configure nodePlugin only when required

Edit the Driver CR that matches the CSI type that must be placed on infra nodes. For example,

For RBD:

oc edit driver openshift-storage.rbd.csi.ceph.com -n openshift-storage -o yaml

For CephFS:

oc edit driver openshift-storage.cephfs.csi.ceph.com -n openshift-storage -o yaml

The Driver CR exposes the same Kubernetes pod nodeAffinity structure under both plugin types. This can be confirmed with oc explain:

oc explain drivers.csi.ceph.io.spec.controllerPlugin.affinity.nodeAffinity --recursive

Example output:

GROUP:      csi.ceph.io
KIND:       Driver
VERSION:    v1

FIELD: nodeAffinity <Object>

DESCRIPTION:
    Describes node affinity scheduling rules for the pod.

FIELDS:
  preferredDuringSchedulingIgnoredDuringExecution <[]Object>
    preference <Object> -required-
      matchExpressions <[]Object>
        key <string> -required-
        operator <string> -required-
        values <[]string>
      matchFields <[]Object>
        key <string> -required-
        operator <string> -required-
        values <[]string>
    weight <integer> -required-
  requiredDuringSchedulingIgnoredDuringExecution <Object>
    nodeSelectorTerms <[]Object> -required-
      matchExpressions <[]Object>
        key <string> -required-
        operator <string> -required-
        values <[]string>
      matchFields <[]Object>
        key <string> -required-
        operator <string> -required-
        values <[]string>

Use the same structure under spec.nodePlugin.affinity.nodeAffinity only when configuring the CSI node plugin pods is required.

Add or update the following fields under spec to place the CSI controller pods on infrastructure nodes:

spec:
  controllerPlugin:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: node-role.kubernetes.io/infra
              operator: Exists
    tolerations:
    - key: node-role.kubernetes.io/infra
      operator: Exists
      effect: NoSchedule

This configuration has the following effect:

  • controllerPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution forces the CSI controller pods to be scheduled only on nodes with the node-role.kubernetes.io/infra label.
  • The toleration allows the CSI controller pods to be scheduled onto infra nodes when the infra nodes have the node-role.kubernetes.io/infra=:NoSchedule taint.

Important: Do not configure nodePlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution by default. The CSI node plugin normally runs as a DaemonSet on each node where workloads might need to mount or use CSI volumes. If nodePlugin is restricted only to infra nodes, application pods on non-infra worker nodes might fail to mount or use PVCs backed by that CSI driver.

Configure nodePlugin only when the environment is intentionally designed so that the corresponding CSI node plugin is required only on infra nodes. For that special case, add the following fields under spec as well:

spec:
  nodePlugin:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: node-role.kubernetes.io/infra
              operator: Exists
    tolerations:
    - key: node-role.kubernetes.io/infra
      operator: Exists
      effect: NoSchedule

This optional nodePlugin configuration has the following effect:

  • nodePlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution causes the CSI node plugin DaemonSet pods to run only on nodes with the node-role.kubernetes.io/infra label.
  • The toleration allows the CSI node plugin DaemonSet pods to be scheduled onto infra nodes when the infra nodes have the node-role.kubernetes.io/infra=:NoSchedule taint.

Not generally recommended: The CSI node plugin is normally expected to run on every worker node, or at least on every node where application workloads might need to mount volumes from this CSI driver. Restricting the nodePlugin DaemonSet to infra nodes should be used only when the cluster design ensures that the corresponding storage-consuming workloads also run only on those nodes, or when there is another confirmed requirement to limit the CSI node plugin to infra nodes.

4. Save the YAML change and repeat it for the required Driver CRs

After adding the YAML under spec.controllerPlugin, and under spec.nodePlugin only when required, save and exit the editor. The operator reconciles the Driver CR and updates the generated CSI workloads.

Repeat the same YAML change for each Ceph CSI Driver CR whose controller pods must be moved to infra nodes. Add the optional nodePlugin placement only for Driver CRs where the CSI node plugin must also be limited to infra nodes. For example, if the following command shows both CephFS and RBD drivers, edit both Driver CRs when both CSI types are expected to run on infra nodes:

oc -n openshift-storage get drivers.csi.ceph.io
NAME                                    AGE
openshift-storage.cephfs.csi.ceph.com   31m
openshift-storage.rbd.csi.ceph.com      31m

Example edit commands:

oc edit driver openshift-storage.rbd.csi.ceph.com -n openshift-storage -o yaml
oc edit driver openshift-storage.cephfs.csi.ceph.com -n openshift-storage -o yaml

5. Verify the Driver CR configuration

oc get driver openshift-storage.rbd.csi.ceph.com \
  -n openshift-storage \
  -o jsonpath='{.spec.controllerPlugin.affinity}{"\n"}{.spec.nodePlugin.affinity}{"\n"}'

oc get driver openshift-storage.cephfs.csi.ceph.com \
  -n openshift-storage \
  -o jsonpath='{.spec.controllerPlugin.affinity}{"\n"}{.spec.nodePlugin.affinity}{"\n"}'

Also check the tolerations:

oc get driver openshift-storage.rbd.csi.ceph.com \
  -n openshift-storage \
  -o jsonpath='{.spec.controllerPlugin.tolerations}{"\n"}{.spec.nodePlugin.tolerations}{"\n"}'

oc get driver openshift-storage.cephfs.csi.ceph.com \
  -n openshift-storage \
  -o jsonpath='{.spec.controllerPlugin.tolerations}{"\n"}{.spec.nodePlugin.tolerations}{"\n"}'

6. Verify the generated CSI pods

Check where the CSI pods are running:

oc get pods -n openshift-storage -o wide | grep -Ei 'csi.*(ctrl|node|nfs)plugin'

Check the node names and confirm that the nodes have the infra label:

oc get pods -n openshift-storage \
  -o custom-columns='NAME:.metadata.name,NODE:.spec.nodeName' \
  | grep -Ei 'csi.*(ctrl|node|nfs)plugin'

oc get nodes -l node-role.kubernetes.io/infra

If old pods are still running on non-infra nodes after the Driver CR is updated, wait for the operator to reconcile the generated resources. If required, delete the old CSI pods so that the owning Deployment or DaemonSet recreates them with the updated scheduling rules:

oc delete pod -n openshift-storage <csi_pod_name>

Do not delete all CSI pods at the same time in a production cluster unless the maintenance impact is understood.

Important considerations

controllerPlugin versus nodePlugin

The controllerPlugin usually runs as a Deployment and contains controller-side CSI components such as provisioner, attacher, resizer, and snapshotter sidecars. Moving these pods to infra nodes is usually the common requirement.

The nodePlugin usually runs as a DaemonSet. Restricting nodePlugin to infra nodes means the CSI node plugin will not run on non-infra worker nodes. As a result, application pods running on non-infra worker nodes might fail to mount or use volumes backed by that CSI driver.

Only restrict nodePlugin to infra nodes when one of the following is true:

  • The workloads that use this CSI driver are also restricted to infra nodes.
  • The cluster design explicitly requires this CSI node plugin only on infra nodes.
  • The impact to volume attach and mount operations on non-infra nodes has been reviewed and accepted.

If workloads on regular worker nodes need to use RBD or CephFS PVCs, do not restrict the corresponding CSI nodePlugin only to infra nodes.

requiredDuringSchedulingIgnoredDuringExecution versus preferredDuringSchedulingIgnoredDuringExecution

Use requiredDuringSchedulingIgnoredDuringExecution when the pods must run only on infra nodes. If no infra node is available, or if the infra nodes do not have enough resources, the pods can remain in Pending state.

Use preferredDuringSchedulingIgnoredDuringExecution when infra nodes should be preferred but not strictly required. Example:

spec:
  controllerPlugin:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: node-role.kubernetes.io/infra
              operator: Exists

Tolerations are required only when infra nodes are tainted

Node affinity selects the target nodes based on labels. It does not bypass taints.

If infra nodes have the following taint:

node-role.kubernetes.io/infra=:NoSchedule

then the Driver CR must include a matching toleration for the pod type that needs to run on those nodes.

Root Cause

StorageCluster.spec.placement and Driver.spec.*Plugin.affinity configure different sets of components.

StorageCluster.spec.placement is used for ODF or Rook-managed Ceph components such as mons, mgrs, OSDs, MDS, RGW, and related ODF components.

Ceph CSI driver pods are generated from the csi.ceph.io/v1 Driver custom resources. Therefore, to place CSI driver pods on infra nodes, configure the placement under the corresponding Driver CR, such as:

spec.controllerPlugin.affinity.nodeAffinity
spec.controllerPlugin.tolerations
spec.nodePlugin.affinity.nodeAffinity
spec.nodePlugin.tolerations

Diagnostic Steps

Confirm that the Driver CRD exposes the expected affinity fields:

oc explain drivers.csi.ceph.io.spec.controllerPlugin.affinity.nodeAffinity --recursive
oc explain drivers.csi.ceph.io.spec.nodePlugin.affinity.nodeAffinity --recursive

The expected output should include both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution under nodeAffinity.

Check whether the Driver CR contains the expected configuration:

oc get driver openshift-storage.rbd.csi.ceph.com -n openshift-storage -o yaml

oc get driver openshift-storage.cephfs.csi.ceph.com -n openshift-storage -o yaml

Check whether the generated workloads contain the expected affinity and tolerations:

oc get deploy,ds -n openshift-storage | grep -Ei 'csi.*(ctrl|node|nfs)plugin'

oc get deploy -n openshift-storage -o yaml | grep -A30 -B5 'nodeAffinity'
oc get ds -n openshift-storage -o yaml | grep -A30 -B5 'nodeAffinity'

Check for scheduling failures:

oc get pods -n openshift-storage | grep -Ei 'Pending|ContainerCreating|csi.*(ctrl|node|nfs)plugin'

oc describe pod -n openshift-storage <pending_csi_pod_name>

Common scheduling-related events include:

node(s) didn't match Pod's node affinity/selector
node(s) had untolerated taint {node-role.kubernetes.io/infra: }

Reverting the change

To remove the strict infra placement from the Driver CR, edit each Driver CR that was changed and remove the custom affinity and tolerations that were added under controllerPlugin, and under nodePlugin if that optional placement was configured.

For RBD:

oc edit driver openshift-storage.rbd.csi.ceph.com -n openshift-storage -o yaml

For CephFS:

oc edit driver openshift-storage.cephfs.csi.ceph.com -n openshift-storage -o yaml

Remove the following fields if they were added only for infra placement:

spec:
  controllerPlugin:
    affinity: ...
    tolerations: ...
  nodePlugin:
    affinity: ...
    tolerations: ...

After the operator reconciles the Driver CR, the generated CSI Deployment and DaemonSet pods can be recreated with the default scheduling configuration.

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.