Why does a RHCS/Ceph cluster report the status of "HEALTH_WARN x near full osd(s)"?

Solution Verified - Updated

Environment

  • Red Hat Ceph Storage

Issue

  • Red Hat Ceph Storage ceph -s reports
# ceph -s
 HEALTH_WARN 
  x near full osd(s)
  • Why does deleting images fails as the Ceph storage becomes full with the below messages in the OSD logs?
 log_channel(cluster) log [WRN] : OSD near full (95%)

Resolution

  • When a Red Hat Ceph Storage cluster gets close to its maximum capacity (i.e., mon_osd_full_ratio), the cluster prevents writes or reads from the underlying OSD disks as a safety measure to prevent data loss.

  • The maximum capacity can be controlled by the tunables mon_osd_full_ratio & mon_osd_nearfull_ratio . By default the values are 0.95 and 0.85 respectively.

  • Check the OSDs utilization and verify those are evenly utilizing the storage space.

# ceph df
# rados df
# ceph osd df  (Available from RHCS1.3.2 onwards)
  • This problem can also happen if some of the OSDs are of different size than others. In such a case, lower sized OSDs may hit the warning earlier than the rest of the OSDs. In such cases, it is better to reduce the CRUSH weight of the lower-sized OSDs so that they are less frequently chosen for data writes.

  • Check the mon_osd_full_ratio and mon_osd_nearfull_ratio are not changed from default. If these values are changed, make sure both are configured with a reasonable high value in the `/etc/ceph/ceph.conf.

    • To check the values, execute the following command from a MON node:
    #  ceph daemon /var/run/ceph/ceph-mon.*.asok config show | egrep "mon_osd_full_ratio|mon_osd_nearfull_ratio"
    
    • To set a new value, use :
    # ceph tell mon.* injectargs '--mon_osd_full_ratio 0.<XY>'
    

NOTE: If the OSD has already hit the default 95% full ratio, it's better to add new OSDs rather than increasing the mon_osd_full_ratio. There is a high chance the new percentage is hit fast, and the writes failing again.

  • Refer the following articles for more information:
  1. This content is not included.What are different full and nearfull ratios in Ceph ?
  2. How to change a Ceph configuration dynamically?
  3. Why does an online change of the Ceph tunables "mon_osd_full_ratio" and "mon_osd_nearfull_ratio" does not take effect ?

Root Cause

  • RHCS clusters prevent writes to the underlying OSD disks when the space hits the percentage set in 'mon_osd_full_ratio`.
SBR
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.