Pods using Persistent Volumes with high file counts fail to start or take an excessive amount of time in OpenShift
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 3
- 4
- Docker Container Engine
- CRI-O Container Engine
- SELinux
Issue
-
Pod deployments are failing with the following message:
Error: Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded -
Pods not able to start falling into
CreateContainerErrorstatus:mypod-5-1111a 0/1 CreateContainerError 0 7m29s -
When attaching volumes to pods in Red Hat OpenShift Container Platform, why do pods sometimes not start, or otherwise take an excessive amount of time to start?
-
The volumes themselves have very high file counts, measured often in tens of thousands of files and directories (or higher).
-
Starting the pods without the high file count volumes allows the pod to become
Readyquickly (but without access to the data the volume provides). -
It is possible that entire nodes sometimes are marked as
NotReadydue to this issue as the container runtime (dockerorcri-o) is unresponsive (as seen with hungdocker psorcrictl pscommands). -
When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve
Readystate?
Resolution
In upstream Kubernetes, and therefore within OpenShift, there exists two issues when Persistent Volumes were mounted to pods upon container creation that can significantly delay a container's ability to start:
- File ownership update causes a significant delay in container startup.
- SELinux file context relabeling causes a significant delay in container startup.
The below section describes the current state of resolution for each issue. For a deeper understanding of the technical aspect of the issue, proceed to the below "Root Cause" section.
File Ownership Update
-
This issue can be mitigated by applying
fsGroupChangePolicyto the security context of all affected Pods with the value ofOnRootMismatch, which will prevent the entire volume from having file permissions re-applied (although it does not prevent it completely). -
The
fsGroupChangePolicymust be applied to all Pods that use PVs and can be affected by this issue. -
To specify security settings for the whole Pod, include the
securityContextfield in the Podspec. The security settings specified at Podspeclevel will apply to all Containers in the Pod. -
As an example, the
securityContextof a pod may have the following line to avoid this problem:securityContext: fsGroupChangePolicy: "OnRootMismatch"
Note: the volume being mounted into the pod must support
fsGrouppermissions functionality, otherwise the above parameter will have no effect. For more information on usingfsGroupto reduce pod timeouts for volumes with a large number of inodes see our documentation.
SELinux File Content Relabeling
IMPORTANT: if the volumes affected by the long SeLinux relabeling are accessed via the
ReadWriteOnce(RWO) access mode, and the cluster is running OpenShift 4.16 or newer, consider first usingReadWriteOncePod(RWOP) access mode in combination with theseLinuxMountoption instead of this solution if the CSI driver used supports it.
Additional Notes:
- If using Openshift Data Foundation / Openshift Container Storage, please review the workarounds to skip SELinux relabeling in Openshift Data Foundation / Openshift Container Storage.
- There currently exist two workarounds for skipping the SELinux relabeling for a volume. These workarounds are OCP generic. Refer the article for implementing SELinux relabeling workaround discussed below automatically using Red Hat Advanced Cluster Manager for Kubernetes.
- Please contact Red Hat Technical Support for direct assistance with this issue.
- For appropriate links and technical explanations, please refer to the "Root Cause" section below.
Manual Skip SELinux Relabeling with spc_t
-
This approach is the simplest, but requires the user to have SecurityContextConstraints (SCC) permission to update the SELinux type of the pod. Furthermore in case of a container runtime vulnerability and the container is not running with the
restrictedSCC, the container could possibly access any file on the host. (see below for details). -
This approach is the only one available in 4.7, first appearing in 4.7.37. It also is present in 4.8.16 and 4.9.2.
-
To implement this workaround, the user must specify
type: "spc_t"either on a pod or containersecurityContext:securityContext: seLinuxOptions: type: "spc_t"If a pod
securityContexthas the typespc_tset, then this type will be inherited by containers having no type specified at all. When this option is configured, CRI-O will skip the relabel, leaving it as it previously was. -
spc_tis a special SELinux type, standing for super privileged container type. A container having this type will not be constrained by SELinux policies . -
If the pod is running with a SCC having
runAsUserset toMustRunAsRangelike therestrictedscc, this is safe because file access is already completely constrained. In the case of a container escape, the container process is running as a random UID and has no rights to modify anything on the host, although it could read world-readable files. -
If the pod is running with a SCC having
runAsUserset torunAsAny, this is less safe because in case of a container runtime escape like This content is not included.This content is not included.https://access.redhat.com/security/vulnerabilities/RHSB-2021-004 an unconstrained container process would be able to overwrite files on the host. -
The default SCC for OpenShift is to run containers with the
restricted.
Steps for skipping SELinux Relabeling with spc_t
1. Create a custom SCC :
# cat custom_scc.yaml
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
type: MustRunAs
groups: []
kind: SecurityContextConstraints
metadata:
name: custom
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
type: MustRunAsRange
seLinuxContext:
type: RunAsAny
supplementalGroups:
type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
$ oc create -f custom_scc.yaml
$ oc get scc
custom false <no value> RunAsAny MustRunAsRange MustRunAs RunAsAny <no value> false
["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"]
restricted false <no value> MustRunAs MustRunAsRange MustRunAs RunAsAny <no value> false ["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"]
2. Assign it to the default service account in a project where deployment is stuck:
$ oc adm policy add-scc-to-user custom -z default -n <namespace>
Note: If you are not using default service account then change `default` with the service account used by the pods.
3. Then added the following changes to the deployment in securityContext :
securityContext:
fsGroupChangePolicy: OnRootMismatch
seLinuxOptions:
type: spc_t
Be careful when changing the default Service Account - this can impact Operator deployments
Note: It appears that behaviour has changed between OCP 4.10 and OCP 4.11: When
scp_tis set inside the deployment, the pods are not restarted and when you initially want to run. To mitigate this, add the customsccand then add the service account to this role.
Semi - Automatic Skip SELinux Relabeling with spc_t (OCP Version 4.13+)
The Cluster Resource Override Operator (CRO) has been extended to allow for customers to automate Skipping the SELinux Relabeling with spc_t. The CRO applies the workaround per namespace and when the CRO is configured with the forceSelinuxRelabel attribute set to true.
The steps to enable the workaround require the operator to be installed via the OperatorHub following the documentation.
1. Create the Namespace
The namespace needs to be created with the following labels. Create a file as cro-workaround-ns.yaml with the following contents:
apiVersion: v1
kind: Namespace
metadata:
name: selinux-relabel
labels:
clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true"
forceselinuxrelabel.admission.node.openshift.io/enabled: "true"
2. Run the following command to create the namespace:
$ oc create -f cro-workaround-ns.yaml
3. Install the Cluster Resource Override Operator:
On step 3(b) of the installation instructions, set the following configuration. You may change the percentages to something other than 100 if you wish to use CRO’s resource overrides.
apiVersion: operator.autoscaling.openshift.io/v1
kind: ClusterResourceOverride
metadata:
name: cluster
spec:
podResourceOverride:
spec:
memoryRequestToLimitPercent: 100
cpuRequestToLimitPercent: 100
limitCPUToMemoryPercent: 100
forceSelinuxRelabel: true
The CRO is now installed and configured. The last item to be configured is the Security Context Constraint (SCC). The SCC needs to be installed and then applied to the namespace the pods that need the workaround run in.
4. Save the following to cro-workaround-scc.yaml to create the SCC called custom:
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
type: MustRunAs
Groups: []
kind: SecurityContextConstraints
metadata:
name: custom
priority: 5
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
type: MustRunAsRange
seLinuxContext:
type: RunAsAny
supplementalGroups:
type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
5. Create the SCC in the global namespace:
$ oc create -f cro-workaround-scc.yaml
6. Apply the SCC to your namespace:
$ oc adm policy add-scc-to-user custom -z default -n selinux-relabel
Where selinux-relabel is the namespace created in step 1.
7. The pod or the deployment should have the label forceselinuxrelabel.admission.node.openshift.io/enabled so that the spc_t context can be applied automatically.
Snippet of a Deployment
template:
metadata:
labels:
app: nginx
forceselinuxrelabel.admission.node.openshift.io/enabled: "true"
Skip SELinux Relabeling if already done with an annotation
- This option is a bit more complex, but is also more secure. It involves adding a custom
RuntimeClass, which, when configured correctly in CRI-O, can interpret an annotation to skip the relabel if the top-level of the volume is found to have the correct label. - It requires 4.8.16, 4.9.2 or any release > 4.10.3
- A drawback of this approach is that the volume will have to be labeled at least once.
- This will be done automatically by CRI-O, but could incur a container creation timeout.
- A consequence of this is that the container processes may fail to access sub-paths of the volume if they're relabeled.
- An improvement in the SELinux relabeling code causes the top-level of the directory to be labeled last. Thus, assuming another process doesn't attempt to relabel a file in the volume, and assuming CRI-O doesn't crash during the intial relabel, the volume should be accessible to the container after the initial relabel.
1. First, a MachineConfig will need to be created to configure CRI-O to have a customized runtime class. This runtime class will be the same as the default one, but configure an allowed_annotation. The resulting CRI-O configuration file and subsequent MachineConfig could look like:
[crio.runtime.runtimes.selinux]
runtime_path = "/usr/bin/runc"
runtime_root = "/run/runc"
runtime_type = "oci"
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-selinux-configuration
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5zZWxpbnV4XQpydW50aW1lX3BhdGggPSAiL3Vzci9iaW4vcnVuYyIKcnVudGltZV9yb290ID0gIi9ydW4vcnVuYyIKcnVudGltZV90eXBlID0gIm9jaSIKYWxsb3dlZF9hbm5vdGF0aW9ucyA9IFsiaW8ua3ViZXJuZXRlcy5jcmktby5UcnlTa2lwVm9sdW1lU0VMaW51eExhYmVsIl0K
mode: 0640
overwrite: true
path: /etc/crio/crio.conf.d/01-selinux.conf
osImageURL: ""
2. Next, a RuntimeClass must be created in the API server. The name should match that described in the CRI-O config above:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: selinux
handler: selinux
3. Finally, the pod should be configured to have the annotation configured in the metadata, as well as the runtime class configured
apiVersion: v1
kind: Pod
metadata:
name: sandbox
annotations:
io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: "true"
...
spec:
runtimeClassName: selinux
...
- Now, when the pod is created, CRI-O will run it with the RuntimeClass
selinux, which is configured to be allowed to process the annotationio.kubernetes.cri-o.TrySkipVolumeSELinuxLabel. This pod has the value "true" forio.kubernetes.cri-o.TrySkipVolumeSELinuxLabel, which means the SELinux relabel will be skipped if the volume is already correctly labeled.
Root Cause
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
There are two distinct issues described in this article. The cause for each is discussed below.
Recursive File Ownership Delays
- When a pod is created and requires a volume to be mounted, the node's
kubeletprocess performed a recursive file ownership change across the entire volume causing significant delays in pod creation. - This issue, while not technically "solved", can be worked around by instructing the
kubeletto perform the bare minimum of file system permissions changes by implementing the workaround mentioned in the above "Resolution" section. - This Content from github.com is not included.Kubernetes Enhancement Proposal discusses a more long-term fix to completely avoid permissions checking by the
kubelet. - Work is ongoing upstream to include this into Kubernetes, and provided the solution is accepted into the community should be rebased into the Red Hat OpenShift Container Platform product.
- Please contact Red Hat Support if you believe you are experiencing this issue.
Recursive SELinux File Context Delays
- When a pod is created and requires a volume to be mounted, the container runtime (either Docker or CRI-O on OpenShift nodes depending on version) is instructed by the node's
kubeletprocess upon pod creation to relabel the entire volume with proper SELinux contexts. - There exist some ways to work around this. By default, volume SElinux context relabeling happens for every volume on container startup and can cause significant delays when the volume has many files and directories as the procedure has to occur asynchronously, recursively through the entire volume. However, if a pod is configured to have
spc_ttype, or is correctly configured to haveio.kubernetes.cri-o.TrySkipVolumeSELinuxLabeland the volume is already correctly labeled, then the relabel will be skipped. - A Content from github.com is not included.Kubernetes Enhancement Program Issue exists, with work from Red Hat within it, attempting to provide a method in future version of Kubernetes (and therefore OpenShift) to avoid recursive relabeling that won't require as invasive/insecure changes to the pod spec. This has been partially implemented in OpenShift 4.16 using the
ReadWriteOncePod(RWOP) access mode and This content is not included.work is ongoing forRWO/RWXaccess modes. - Discussion of relying on the upcoming feature
fsGroupChangePolicyhas occurred Content from github.com is not included.within this GitHub issue and can be read about within Content from github.com is not included.this KEP update but is subject to change. - Please contact Red Hat Support if you believe you are experiencing this issue.
Diagnostic Steps
In order to confirm if SELinux relabeling is causing time outs while trying to start the Pod, a coredump from CRI-O during the timeouts will be needed to validate it is stuck relabeling all of the files.
In the coredump you will look for recursive file.walk functions on goroutines that are performing SELinux relabels in CRI-O, indicating that it is hanging/working on performing an excessive relabel due to the amount of files on the PV.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.