How to clean CRI-O storage in Red Hat OpenShift 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Container runtime
- CRI-O
Issue
- Note: This should only be used if CRI-O is not able to create new workloads. It should not be a catch-all remediation for delays in pod startup.
- A case this should especially not be used is when encountering errors of the form
Error reserving ctr name %s for id %s: name is reserved. These errors happen because of a bottleneck on the node. Removing and recreating all of the pods is only likely to exacerbate the issue.
- A case this should especially not be used is when encountering errors of the form
- How to wipe CRI-O ephemeral storage?
- The kubelet service is restarting continuously on a node.
- A node can't run any pods with the container or CRI-O errors, including but not limited to the following ones:
-
Failing pod sandboxes due to container-related errors:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container XXX: error recreating the missing symlinks: error reading name of symlink for XXX: open /var/lib/containers/storage/overlay/XXX/link: no such file or directory -
Container creation is failing with the
can't stat lower layererrorcan't stat lower layer ... because it does not exist. Going through storage to recreate the missing symlinks. -
CRI-O failures due to container storage issues:
Failed to remove storage directory: unlinkat /var/lib/containers/storage/overlay-containers/586f92c81c4576e97be1091af010449e04feedaed726d28b8eb840ff87ec10c8/userdata/shm: device or resource busy -
CRI-O is continuously killed by a SIG ABRT, generating a stack trace that may look similar to the example in "diagnostic steps".
-
Failing to pull images due to error committing the images:
Failed to pull image "registry.redhat.io/ocs4/cephcsi-rhel8@sha256:bf274db28eb745135995c598c7c938509b1a22e4d526956c66eef4416af919c6": rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:2c2be27b4878555a6af9e39d5494444d6152e2a94aafff5d7132c515c373dcf5": error creating layer with ID "a5a1f55d627a5893af066519c6113c122becb8627cb9aca8e299138e406a46e1": Stat /var/lib/containers/storage/overlay/0c51fae87c95e09f4dfccee417713913004c97821d72473d07faeba9c4132761: no such file or directory
-
Resolution
This procedure will wipe the CRI-O ephemeral storage completely.
Without Node Reboot
See the official documentation.
NOTE: Currently (version 4.11 and older) following the documentation may result in Error finding container. The issue is that the networking plugin pod needs to be deleted last in order to delete the networking resources for other pods.
Instead of simply using crictl rmp -fa the operator can use the following command:
$ ssh -i .ssh/id_rsa core@<work-node-ip>
$ sudo -i
# systemctl stop kubelet.service
# for pod in $(crictl pods -q); do if [[ "$(crictl inspectp $pod | jq -r .status.linux.namespaces.options.network)" != "NODE" ]]; then crictl rmp -f $pod; fi; done
# crictl rmp -fa
Continue from step 4 below.
With Node Reboot
NOTE: Steps 2 through 5 need to be executed on the node that is drained. As part of the process, kubelet will be disabled for this reason make sure you can ssh to the node before beginning.
- Drain the node as a cluster-admin user.
### For RHOCP >= 4.7, see https://access.redhat.com/solutions/6801291
# oc adm drain NODENAME --ignore-daemonsets --delete-emptydir-data --disable-eviction --force
### For RHOCP < 4.6
# oc adm drain NODENAME --ignore-daemonsets --delete-local-data --disable-eviction --force
- Disable the kubelet service (CRI-O won't start because it is never enabled and only starts as a dependency of the kubelet).
$ sudo -i
# systemctl disable kubelet.service
- Reboot the node. Try a soft reboot first, with a command like this:
# systemctl reboot
If the VM doesn't reboot cleanly after having performed this command, try to force a hard reboot.
- Perform the following steps before re-enabling and starting the kubelet.
$ sudo -i
# rm -rvf /var/lib/containers/*
# crio wipe -f
If rm -rvf fails due to "Device or resource busy", make sure that cri-o is stopped before trying again:
# rm -rvf /var/lib/containers/*
rm: cannot remove '/var/lib/containers/storage/overlay': Device or resource busy
# systemctl stop crio.service
# rm -rvf /var/lib/containers/*
# crio wipe -f
# systemctl start crio.service
- After that enable and start the
kubelet.serviceagain:
# systemctl enable --now kubelet.service
- After a few minutes the node should be in Ready status again (execute the below command from the host where you are logged in).
# oc get nodes
- When the node is in Ready status, uncordon it so new pods can be scheduled.
From the host where you are logged into the OCP, execute the following command.
# oc adm uncordon NODENAME
- Check new pods are being deployed correctly.
Diagnostic Steps
-
Check pod logs and events from the namespace:
# oc logs <pod-name> # oc get events # oc describe pod/<pod-name> -
Check pods status and CRI-O logs in a worker node:
# oc get pods --all-namespaces --field-selector spec.nodeName=NODENAME # oc adm node-logs --unit crio NODENAME -
Check the logs of crio
# journalctl -u crioThey may be showing that crio is being killed with SIGABRT and a stack trace that may look similar to this:
May 24 07:22:01 odf-01.qa.ocp.example.com systemd[1]: crio.service: Main process exited, code=killed, status=6/ABRT May 24 07:22:01 odf-01.qa.ocp.example.com systemd[1]: crio.service: Failed with result 'signal'. May 24 07:22:01 odf-01.qa.ocp.example.com systemd[1]: crio.service: Consumed 861ms CPU time May 24 07:22:01 odf-01.qa.ocp.example.com systemd-coredump[1276211]: Process 1276062 (crio) of user 0 dumped core. Stack trace of thread 1276205: #0 0x000055f0c63e7961 runtime.raise (crio) #1 0x000055f0c63c35f1 runtime.sigfwdgo (crio) #2 0x000055f0c63c1df4 runtime.sigtrampgo (crio) #3 0x000055f0c63e7ce3 runtime.sigtramp (crio) #4 0x00007fc0a0899b20 __restore_rt (libpthread.so.0) #5 0x000055f0c63e7961 runtime.raise (crio) #6 0x000055f0c63ab62e runtime.fatalpanic (crio) #7 0x000055f0c63aaf65 runtime.gopanic (crio) #8 0x000055f0c6ea0e4b github.com/cri-o/cri-o/vendor/go.etcd.io/bbolt.(*freelist).read (crio) #9 0x000055f0c6eab597 github.com/cri-o/cri-o/vendor/go.etcd.io/bbolt.(*DB).loadFreelist.func1 (crio) #10 0x000055f0c63ff1ce sync.(*Once).doSlow (crio) #11 0x000055f0c6e9b68c github.com/cri-o/cri-o/vendor/go.etcd.io/bbolt.(*DB).loadFreelist (crio) #12 0x000055f0c6e9b12f github.com/cri-o/cri-o/vendor/go.etcd.io/bbolt.Open (crio) #13 0x000055f0c6eadf95 github.com/cri-o/cri-o/vendor/github.com/containers/image/v5/pkg/blobinfocache/boltdb.(*cache).update (crio) #14 0x000055f0c6eae68f github.com/cri-o/cri-o/vendor/github.com/containers/image/v5/pkg/blobinfocache/boltdb.(*cache).RecordKnownLocation (crio) #15 0x000055f0c7195849 github.com/cri-o/cri-o/vendor/github.com/containers/image/v5/docker.(*dockerImageSource).GetBlob (crio) #16 0x000055f0c70e354a github.com/cri-o/cri-o/vendor/github.com/containers/image/v5/copy.(*imageCopier).copyLayer (crio) #17 0x000055f0c70ebda5 github.com/cri-o/cri-o/vendor/github.com/containers/image/v5/copy.(*imageCopier).copyLayers.func1 (crio) #18 0x000055f0c63e6141 runtime.goexit (crio) May 24 07:22:01 odf-01.qa.ocp.example.com systemd[1]: crio.service: Service RestartSec=100ms expired, scheduling restart. May 24 07:22:01 odf-01.qa.ocp.example.com systemd[1]: crio.service: Scheduled restart job, restart counter is at 54727. May 24 07:22:01 odf-01.qa.ocp.example.com systemd[1]: Stopping Kubernetes Kubelet...
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.