Why are pods mounting volumes with a huge number of files failing to start after upgrading to Openshift Data Foundation 4.12?
Environment
- Red Hat Openshift Data Foundation, versions:
- v4.12
- v4.13
Issue
-
After upgrading from ODF 4.11 to 4.12, pods attaching volumes with many files ( round to millions ) fail to start with a timeout. These are some sample events reported. In the example below, the affected pod is named
simple-app-67dfcff4c8-v7gxv. Note the eventstimed out waiting for the condition:oc describe pod simple-app-67dfcff4c8-v7gxv Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> Successfully assigned test/simple-app-67dfcff4c8-v7gxv to worker-2.example.com Normal SuccessfulAttachVolume 30m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-0ea2d69a-8e9d-41b2-bfa5-85d0ede6211b" Warning FailedMount 23m (x2 over 25m) kubelet, worker-2.example.com Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[volume-wz7wf kube-api-access-hmlzb]: timed out waiting for the condition Warning FailedMount 19m (x3 over 28m) kubelet, worker-2.example.com Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[kube-api-access-hmlzb volume-wz7wf]: timed out waiting for the condition Warning FailedMount 4m32s (x5 over 15m) kubelet, worker-2.example.com Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[volume-wz7wf kube-api-access-hmlzb]: timed out waiting for the condition Warning FailedMount 2m15s (x2 over 6m46s) kubelet, worker-2.example.com Unable to attach or mount volumes: unmounted volumes=[volume-wz7wf], unattached volumes=[kube-api-access-hmlzb volume-wz7wf]: timed out waiting for the condition -
The volume is correctly mounted in the node hosting the pod. In this example, it's a CephFS volume:
$ oc debug node/worker-2.example.com # chroot /host sh-4.4# mount -l | grep cephfs <mon-ip-1>:6789,<mon-ip-2>:6789,<mon-ip-3>:6789,<mon-ip-4>:6789,<mon-ip-4>:6789:/volumes/csi/csi-vol-5c8b10be-6760-11ee-ada8-0a580a800215/c7feed46-8956-45c2-b71f-906ae4ad4718 on /host/var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/c7dcb34afe060d6cd58e994fc5c10868624970393d6415e2c085b8c6630532b0/globalmount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=my-filesystem) <mon-ip-1>:6789,<mon-ip-2>:6789,<mon-ip-3>:6789,<mon-ip-4>:6789,<mon-ip-5>:6789:/volumes/csi/csi-vol-5c8b10be-6760-11ee-ada8-0a580a800215/c7feed46-8956-45c2-b71f-906ae4ad4718 on /host/var/lib/kubelet/pods/029e640c-2db9-49dd-ae9e-5215de6b11f7/volumes/kubernetes.io~csi/pvc-0ea2d69a-8e9d-41b2-bfa5-85d0ede6211b/mount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=my-filesystem)This mount point is also writable.
-
Why is this issue occurring? How to prevent this problem?
Resolution
There are two workarounds available:
-
Option A
-
This is the preferred alternative. Set the
fsGroupChangePolicyto the security context of the pod with the value ofOnRootMismatch. Example:securityContext: fsGroupChangePolicy: "OnRootMismatch"
This is required based on the KCS Article 6221251 - When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve "Ready" state?
-
Please note that the first time the pod will remount the CephFS file system after upgrading to ODF 4.12, it might take a lot of time to start. This is due to the permission change done as a consequence of the security changes introduced in ODF 4.12, as the root of the file system is modified from permissions
rwxrxwrxw root root /root-mount-point
to permissions
rwxrwsrxw root <fsgroupID> /root-mount-point The amount of time the pod will take to start is not predictable, as it depends on the hardware capacity of the worker nodes and the number of files inside the affected CephFS persistent volume. Once the permissions are changed after this first remount, the subsequent pod restarts will be immediate. -
-
Option B
The permission change will not occur if the
fsgroupsetting is not applied to the application pod. For that:-
Ensure the pod user is added to
anyuidSCC. Sample command:oc adm policy add-scc-to-user anyuid -z default -n <namespce>
In this example, the service account applied to the application pod was
default. Please replace it with the service account/user specified in the application pod. Please refer to links 1 and This page is not included, but the link has been rewritten to point to the nearest parent document.2 in Openshift documentation for further information on SCCs.-
This is the most crucial part. In the pod specification, remove all the references to the
fsgroupsetting and userunAsGroupinstead. This will prevent the permission change and, hence, the delay during the pod startup. Sample security context to apply:securityContext: runAsUser: <uid required> runAsNonRoot: true runAsGroup: <gid required> ---> Use runAsGroup and not fsGroup.
Please note that not setting
fsgrouphas the following drawbacks:- If the permissions of the volume are not
777or if there is a non-uniform set of permissions ( for instance, one directory inside the volume having permissions755assigned toUID:GID 300:300and another directory with permissions755assigned to400:400), this approach will not work. - The user set in the
runAsUserand must not change and should match for all the pods accessing this volume.
-
Root Cause
Starting with Openshift Data Foundation 4.12, there are significant changes applied:
- ODF changed the default permission of a volume to more secure 755 instead of 777.
FSGroupPolicywas set toFile( instead ofReadWriteOnceWithFSTypein ODF 4.11 ) to allow application access to volumes based onFSGroup. This involved Kubernetes usingfsGroupto change permissions and ownership of the volume to match user requestedfsGroupin the pod'sSecurityPolicyregardless offstypeor access mode.- Due to huge number of files, changing permissions and ownership takes a lot of time.
Diagnostic Steps
Verify that the affected pod is recursively changing the permissions of the CephFS volume.
-
Get the pod ID:
$ oc get pod <pod-name> -o jsonpath='{.metadata.uid}' -
Get the node where the pod is running:
oc get pod <pod-name> -o wide -
SSH into the node hosting the pod and verify the recursive ownership change is taking place:
$ oc debug node/<node-id> $ journalctl -u kubelet --no-pager | grep <podID>There should be messages like this below:
kubenswrapper[3792179]: I1009 14:04:46.239499 3792179 volume_linux.go:109] "Perform recursive ownership change for directory" path="/var/lib/kubelet/pods/<podID>....
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.