ceph-csi-controller-manager Pod has High Restarts and/or is Stuck in CrashLoopBackOff
Environment
- Red Hat OpenShift Container Platform 4.17+
- Red Hat OpenShift Data Foundation (ODF) 4.17+
Issue
- The
ceph-csi-controller-manager-*pod is stuck inCrashLoopBackOff - The error
OOMKilledcan be seen in the pod description
Resolution
ⓘ This process involves editing the ceph-csi-operator ClusterServiceVersion (CSV). In the event of an OoenShift Data Foundation upgrade, the new CSV will overwrite the previous one, and this solution may need to be re-applied if the ceph-csi-controller-manager continues to be OOMKilled.
- Capture the current CephCSI Operator version:
$ oc get csv -n openshift-storage -l operators.coreos.com/cephcsi-operator.openshift-storage=
- Edit the cephcsi-operator CSV:
$ oc edit csv -n openshift-storage cephcsi-operator.v<version>-rhodf
- Increase the memory requests and limits on both of the containers in the CSV:
<extra-output-removed-for-space>
resources:
limits:
cpu: 500m
memory: 128Mi <--- Increase to 256Mi
requests:
cpu: 5m
memory: 64Mi <--- Increase to 128Mi
<extra-output-removed-for-space>
resources:
limits:
cpu: 500m
memory: 128Mi <--- Increase to 256Mi
requests:
cpu: 10m
memory: 64Mi <--- Increase to 128Mi
- Verify that the limits and requests were properly adjusted:
Utilizing jsonpath: $ oc get csv cephcsi-operator.v<version>-rhodf -o jsonpath='{.spec.install.spec.deployments[*].spec.template.spec.containers[*].resources}'
Utilizing jq: $ oc get csv cephcsi-operator.v<version>-rhodf -o json | jq '.spec.install.spec.deployments[].spec.template.spec.containers[].resources'
Example:
$ oc get csv cephcsi-operator.v4.17.2-rhodf -o json | jq '.spec.install.spec.deployments[].spec.template.spec.containers[].resources'
{
"limits": {
"cpu": "500m",
"memory": "30Mi"
},
"requests": {
"cpu": "5m",
"memory": "5Mi"
}
}
{
"limits": {
"cpu": "500m",
"memory": "30Mi"
},
"requests": {
"cpu": "10m",
"memory": "5Mi"
}
}
- If the frequent pod restarts are still observed, consider increasing the memory requests/limits again.
Root Cause
The memory requests and limits for the ceph-csi-controller-manager may not suffice. Increasing the memory requests and limits should alleviate this issue.
Diagnostic Steps
- Check the pods in the
openshift-storagenamespace and validate that theceph-csi-controller-manager-*pod is inCrashLoopBackOffstate or is experiencing high amount of pod restarts.
$ oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
ceph-csi-controller-manager-847c49bf46-lprvv 1/2 CrashLoopBackOff 13 (33s ago) 49m
csi-addons-controller-manager-68cffdb84b-9k7nn 2/2 Running 2 (39m ago) 49m
- The
Last State:of themanagercontainer in theceph-csi-controller-manager-*pod will beTerminated, and theReason:will beOOMKilled
$ oc -n openshift-storage describe pod ceph-csi-controller-manager-847c49bf46-lprvv
<extra-output-removed-for-space>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled <---
Exit Code: 137
Started: Thu, 12 Dec 2024 15:44:07 -0500
Finished: Thu, 12 Dec 2024 15:44:25 -0500
Ready: False
Restart Count: 13
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.