ceph-csi-controller-manager Pod has High Restarts and/or is Stuck in CrashLoopBackOff

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform 4.17+
  • Red Hat OpenShift Data Foundation (ODF) 4.17+

Issue

  • The ceph-csi-controller-manager-* pod is stuck in CrashLoopBackOff
  • The error OOMKilled can be seen in the pod description

Resolution

This process involves editing the ceph-csi-operator ClusterServiceVersion (CSV). In the event of an OoenShift Data Foundation upgrade, the new CSV will overwrite the previous one, and this solution may need to be re-applied if the ceph-csi-controller-manager continues to be OOMKilled.

  1. Capture the current CephCSI Operator version:

$ oc get csv -n openshift-storage -l operators.coreos.com/cephcsi-operator.openshift-storage=

  1. Edit the cephcsi-operator CSV:

$ oc edit csv -n openshift-storage cephcsi-operator.v<version>-rhodf

  1. Increase the memory requests and limits on both of the containers in the CSV:
<extra-output-removed-for-space>
                resources:  
                  limits:   
                    cpu: 500m
                    memory: 128Mi <--- Increase to 256Mi
                  requests: 
                    cpu: 5m 
                    memory: 64Mi <--- Increase to 128Mi
<extra-output-removed-for-space>
                resources:  
                  limits:   
                    cpu: 500m
                    memory: 128Mi <--- Increase to 256Mi
                  requests: 
                    cpu: 10m
                    memory: 64Mi <--- Increase to 128Mi
  1. Verify that the limits and requests were properly adjusted:
Utilizing jsonpath: $ oc get csv cephcsi-operator.v<version>-rhodf -o jsonpath='{.spec.install.spec.deployments[*].spec.template.spec.containers[*].resources}'

Utilizing jq: $ oc get csv cephcsi-operator.v<version>-rhodf -o json | jq '.spec.install.spec.deployments[].spec.template.spec.containers[].resources'

Example:

$ oc get csv cephcsi-operator.v4.17.2-rhodf -o json | jq '.spec.install.spec.deployments[].spec.template.spec.containers[].resources'
{
  "limits": {
    "cpu": "500m",
    "memory": "30Mi"
  },
  "requests": {
    "cpu": "5m",
    "memory": "5Mi"
  }
}
{
  "limits": {
    "cpu": "500m",
    "memory": "30Mi"
  },
  "requests": {
    "cpu": "10m",
    "memory": "5Mi"
  }
}
  1. If the frequent pod restarts are still observed, consider increasing the memory requests/limits again.

Root Cause

The memory requests and limits for the ceph-csi-controller-manager may not suffice. Increasing the memory requests and limits should alleviate this issue.

Diagnostic Steps

  • Check the pods in the openshift-storage namespace and validate that the ceph-csi-controller-manager-* pod is in CrashLoopBackOff state or is experiencing high amount of pod restarts.
$ oc get pods -n openshift-storage
NAME                                               READY   STATUS             RESTARTS       AGE
ceph-csi-controller-manager-847c49bf46-lprvv       1/2     CrashLoopBackOff   13 (33s ago)   49m
csi-addons-controller-manager-68cffdb84b-9k7nn     2/2     Running            2 (39m ago)    49m
  • The Last State: of the manager container in the ceph-csi-controller-manager-* pod will be Terminated, and the Reason: will be OOMKilled
$ oc -n openshift-storage describe pod ceph-csi-controller-manager-847c49bf46-lprvv
<extra-output-removed-for-space>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled <--- 
      Exit Code:    137
      Started:      Thu, 12 Dec 2024 15:44:07 -0500
      Finished:     Thu, 12 Dec 2024 15:44:25 -0500
    Ready:          False
    Restart Count:  13
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.