Why are pods mounting volumes with a huge number of files failing to start after upgrading to Openshift Data Foundation 4.12?

Solution Unverified - Updated

Environment

  • Red Hat Openshift Data Foundation, versions:
    • v4.12
    • v4.13

Issue

  • After upgrading from ODF 4.11 to 4.12, pods attaching volumes with many files ( round to millions ) fail to start with a timeout. These are some sample events reported. In the example below, the affected pod is named simple-app-67dfcff4c8-v7gxv. Note the events timed out waiting for the condition:

      oc describe pod simple-app-67dfcff4c8-v7gxv
      
      	Type     Reason                  Age                    From                                                   Message
          ----     ------                  ----                   ----                                                   -------
          Normal   Scheduled               <unknown>                                                                     Successfully assigned test/simple-app-67dfcff4c8-v7gxv to worker-2.example.com
          Normal   SuccessfulAttachVolume  30m                    attachdetach-controller                                AttachVolume.Attach succeeded for volume "pvc-0ea2d69a-8e9d-41b2-bfa5-85d0ede6211b"
          Warning  FailedMount             23m (x2 over 25m)      kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[volume-wz7wf kube-api-access-hmlzb]: timed out waiting for the condition
          Warning  FailedMount             19m (x3 over 28m)      kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[kube-api-access-hmlzb volume-wz7wf]: timed out waiting for the condition
          Warning  FailedMount             4m32s (x5 over 15m)    kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[volume-wz7wf kube-api-access-hmlzb]: timed out waiting for the condition
          Warning  FailedMount             2m15s (x2 over 6m46s)  kubelet, worker-2.example.com  Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[kube-api-access-hmlzb volume-wz7wf]: timed out waiting for the condition
    
  • The volume is correctly mounted in the node hosting the pod. In this example, it's a CephFS volume:

      $ oc debug node/worker-2.example.com
      # chroot /host
      sh-4.4# mount -l | grep cephfs
        <mon-ip-1>:6789,<mon-ip-2>:6789,<mon-ip-3>:6789,<mon-ip-4>:6789,<mon-ip-4>:6789:/volumes/csi/csi-vol-5c8b10be-6760-11ee-ada8-0a580a800215/c7feed46-8956-45c2-b71f-906ae4ad4718 on /host/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/c7dcb34afe060d6cd58e994fc5c10868624970393d6415e2c085b8c6630532b0/globalmount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=my-filesystem)
        <mon-ip-1>:6789,<mon-ip-2>:6789,<mon-ip-3>:6789,<mon-ip-4>:6789,<mon-ip-5>:6789:/volumes/csi/csi-vol-5c8b10be-6760-11ee-ada8-0a580a800215/c7feed46-8956-45c2-b71f-906ae4ad4718 on /host/var/lib/kubelet/pods/029e640c-2db9-49dd-ae9e-5215de6b11f7/volumes/kubernetes.io~csi/pvc-0ea2d69a-8e9d-41b2-bfa5-85d0ede6211b/mount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=my-filesystem)
    

    This mount point is also writable.

  • Why is this issue occurring? How to prevent this problem?

Resolution

There are two workarounds available:

  • Option A

    • This is the preferred alternative. Set the fsGroupChangePolicy to the security context of the pod with the value of OnRootMismatch. Example:

        securityContext:
          fsGroupChangePolicy: "OnRootMismatch"
      

    This is required based on the KCS Article 6221251 - When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve "Ready" state?

    • Please note that the first time the pod will remount the CephFS file system after upgrading to ODF 4.12, it might take a lot of time to start. This is due to the permission change done as a consequence of the security changes introduced in ODF 4.12, as the root of the file system is modified from permissions

         rwxrxwrxw root root /root-mount-point
      

    to permissions

           rwxrwsrxw root <fsgroupID> /root-mount-point 
    
     The amount of time the pod will take to start is not predictable, as it depends on the hardware capacity of the worker nodes and the number of files inside the affected CephFS persistent volume. Once the permissions are changed after this first remount, the subsequent pod restarts will be immediate. 
    
  • Option B

    The permission change will not occur if the fsgroup setting is not applied to the application pod. For that:

    1. Ensure the pod user is added to anyuid SCC. Sample command:

       oc adm policy add-scc-to-user anyuid -z default -n <namespce>
      

    In this example, the service account applied to the application pod was default. Please replace it with the service account/user specified in the application pod. Please refer to links 1 and This page is not included, but the link has been rewritten to point to the nearest parent document.2 in Openshift documentation for further information on SCCs.

    1. This is the most crucial part. In the pod specification, remove all the references to the fsgroup setting and use runAsGroup instead. This will prevent the permission change and, hence, the delay during the pod startup. Sample security context to apply:

       securityContext:
         runAsUser: <uid required>
         runAsNonRoot: true
         runAsGroup: <gid required> ---> Use runAsGroup and not fsGroup.
      

    Please note that not setting fsgroup has the following drawbacks:

    • If the permissions of the volume are not 777 or if there is a non-uniform set of permissions ( for instance, one directory inside the volume having permissions 755 assigned to UID:GID 300:300 and another directory with permissions 755assigned to 400:400 ), this approach will not work.
    • The user set in the runAsUser and must not change and should match for all the pods accessing this volume.

Root Cause

Starting with Openshift Data Foundation 4.12, there are significant changes applied:

  1. ODF changed the default permission of a volume to more secure 755 instead of 777.
  2. FSGroupPolicy was set to File ( instead of ReadWriteOnceWithFSType in ODF 4.11 ) to allow application access to volumes based on FSGroup. This involved Kubernetes using fsGroup to change permissions and ownership of the volume to match user requested fsGroup in the pod's SecurityPolicy regardless of fstype or access mode.
  3. Due to huge number of files, changing permissions and ownership takes a lot of time.

Diagnostic Steps

Verify that the affected pod is recursively changing the permissions of the CephFS volume.

  1. Get the pod ID:

     $ oc get pod <pod-name> -o jsonpath='{.metadata.uid}'
    
  2. Get the node where the pod is running:

     oc get pod <pod-name> -o wide 
    
  3. SSH into the node hosting the pod and verify the recursive ownership change is taking place:

      $ oc debug node/<node-id>
      $ journalctl -u kubelet --no-pager | grep <podID>
    

    There should be messages like this below:

     kubenswrapper[3792179]: I1009 14:04:46.239499 3792179 volume_linux.go:109] "Perform recursive ownership change for directory" path="/var/lib/kubelet/pods/<podID>....
    
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.