ODF is showing "nearfull osd(s)" warning
Environment
ODF 4.x
Issue
- ODF console is showing "nearfull osd(s)" warning.
- The "ceph status" output in an ODF cluster shows "nearfull osd(s)" warning
cluster:
id: xxxxxxxx-xxx-xxx-xxxx-xxxxxxxxxxx
health: HEALTH_WARN
2 nearfull osd(s)
11 pool(s) nearfull
Resolution
There are multiple ways to reclaim space and potentially resolve this warning:
-
Delete any unneeded CephFS or RBD PVCs from the cluster.
-
Delete any unneeded CephFS or RBD snapshots from the cluster.
-
Run
fstrim -avon all OCP worker nodes nodes (Note: Thefstrimcommand could in theory have a temporary negative effect on the OSDs client i/o throughput. If you're at all concerned about this, please schedule this activity during a maintenance window. If you'd like to runfstrimon individual RBD PVCs, or just would like more information about what this does, please refer to this kcs article)$ for i in $(oc get node -l '!node-role.kubernetes.io/master' -o name); do oc debug $i -- chroot /host fstrim -av; done -
Run reclaim space jobs on each RBD PVC. This will be slower and more tedious than Step 3, but the benefit to this process is the ability to schedule the activity to minimize impact during peak operating times. The reclaim space job performs fstrim and an rbd sparsify resulting in a deeper clean. It's recommended to target only the largest of RBD PVCs given the manual nature of this procedure. Please see the steps outlined in Chapter 9 of our documentation for more information. Chapter 9. Reclaiming space on target volumes
-
Check for stale or orphaned CephFS volumes using: https://access.redhat.com/solutions/7130240
-
Check for stale or orphaned RBD volumes: https://access.redhat.com/solutions/6293581
-
If ElastiSearch or Loki is using CephFS, RBD PVCs, or RGW/Noobaa OBCs for log storage, consider altering your cluster and application log retention policies: https://access.redhat.com/solutions/4099671
If the methods listed above do not resolve the nearfull warnings, you may need to expand the ODF cluster with more OSDs. Please refer to our documentation for instructions on how to add OSDs to the ODF cluster.
If none of the above solutions resolve the warning, please submit a case with Red Hat Support to assist with resolving this.
Root Cause
ODF, like any storage solution, eventually fills up with data over time. This warning is communicating that the underlying storage devices are filling up, and the capacity of the storage devices are at risk of exceeding the "full threshhold" (which is 85% percent in odf). It's much harder to address capacity problems in a cluster once it's reached its "full threshhold", so please take action to address capacity problems while the osds are "nearfull" as opposed to "full".
Diagnostic Steps
- Check the ODF console for warnings related to "nearfull osd(s)"
- Run the following command from an oc client and look for any warnings related to "nearfull osd(s)":
$ oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph status -c /var/lib/rook/openshift-storage/openshift-storage.config
cluster:
id: xxxxxxxx-xxx-xxx-xxxx-xxxxxxxxxxx
health: HEALTH_WARN
2 nearfull osd(s)
11 pool(s) nearfull
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.