ceph df MAX AVAIL is incorrect for simple replicated pool

Solution Verified - Updated 9 Aug 2024

Environment

Red Hat Ceph Storage 1.3.z
Upstream Hammer release
Red Hat Ceph Storage 2.y
Upstream Jewel release
Red Hat Ceph Storage 3.y
Upstream Luminous release

Issue

ceph df MAX AVAIL is incorrect for simple replicated pool

Resolution

The MAX AVAIL value does not represent the amount of free space. Rather it represents the amount that can be written until the highest used OSD will get full. It is a complicated function of the replication or erasure code used, the CRUSH rule that maps storage to devices, the utilization of those devices, and the configured mon_osd_full_ratio.

Throughout the History ceph used different formulas to calculate this value:

Ceph uses below given formula to calculate MAX AVAIL value :

  min(osd.avail for osd in OSD_up)* len(osd.avail for osd in OSD_up) /pool.size()

In ceph hammer/jewel code release function PGMonitor::dump_pool_stats() calculates this value and this function definition is here -Content from github.com is not included.Content from github.com is not included.https://github.com/ceph/ceph/blob/jewel/src/mon/PGMonitor.cc. Please search for PGMonitor::dump_pool_stats and this function then calls - get_rule_avail().
In Ceph luminous code release function PGMapDigest::dump_pool_stats_full() calculates this value and this function definition is here - Content from github.com is not included.Content from github.com is not included.https://github.com/ceph/ceph/blob/luminous/src/mon/PGMap.cc. Please search for PGMapDigest::dump_pool_stats_full and this function calls - get_rule_avail().

min(osd.avail for osd in OSD_up) : Minimum space left in an OSD in up set in pool crush ruleset and same what Sage has suggested in Content from tracker.ceph.com is not included.upstream tracker #13844 that your usage is bounded by osd.X.
len(osd.avail for osd in OSD_up) : Number of OSDs in UP set in pool crush ruleset
pool.size() : pool replication size
Ceph count every OSD that can be taken by the crush map by following the rule used by the pool in question.

in RHCS 2.5 the MAX AVAIL value calculation was changed and now the 'mon_osd_full_ratio' is taken into account too

  [min(osd.avail for osd in OSD_up) - ( min(osd.avail for osd in OSD_up).total_size * (1 - mon_osd_full_ratio)) ]* len(osd.avail for osd in OSD_up) /pool.size()

-mon_osd_full_ratio = 0.95

mon/PGMap: factor mon_osd_full_ratio into MAX AVAIL calc backport for for Jewel 10.2.10 on which RHCS 2.5 is based on Content from tracker.ceph.com is not included.upstream tracker #20036

in In RHCS 3 the MAX AVAIL value calculation was changed and now the 'full_ratio' is taken into account - default value 0.95

  [min(osd.avail for osd in OSD_up) - ( min(osd.avail for osd in OSD_up).total_size * (1 - full_ratio)) ]* len(osd.avail for osd in OSD_up) /pool.size()

what is your current 'full_ratio' value you can get:

# ceph osd dump | grep full_ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

and change it with following command:

# ceph osd set-full-ratio 0.9
#  ceph osd dump | grep full_ratio
full_ratio 0.9
backfillfull_ratio 0.9
nearfull_ratio 0.85

github commits for Content from github.com is not included.mon/OSDMonitor: implement new 'osd set-[near]full-ratio ...' commands and Content from github.com is not included.mon: Use currently configure full ratio to determine available space

Root Cause

Ceph pool usage could be bounded by osd.X as given in this Content from tracker.ceph.com is not included.upstream tracker #13844.
For more information please check resolution and diagnostic sections.

Diagnostic Steps

Ceph df

GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    3195T     2436T         759T         23.77

rbd  57       252T      7.90          345T     66220069

Now check the rbd pool ruleset :

pool 57 'rbd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 16384 pgp_num 16384 last_change 205704 flags hashpspool stripe_width 0

# rules
rule mkt_ext_ruleset {
	ruleset 1 <===========================================
	type replicated
	min_size 1
	max_size 7
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

root default {
	id -2		# do not change unnecessarily
	# weight 2096.640
	alg straw
	hash 0	# rjenkins1
	item node_a weight 174.720
	item node_b weight 174.720
	item node_c weight 174.720
	item node_d weight 174.720
	item node_e weight 174.720
	item node_f weight 174.720
	item node_g weight 174.720
	item node_h weight 174.720
	item node_i weight 174.720
	item node_j weight 174.720
	item node_k weight 174.720
	item node_l weight 174.720
}

I am not adding all the host_group and host bucket list for this rule here.
But I have checked this rule has OSDs starting from OSD.0 to OSD.575 means total : 576 OSDs.
Now let us check the minimum space left osd starting from OSD.0 to OSD.575 , checking ceph osd df :

31 3.64000  1.00000 3723G 1881G 1842G 50.53 2.13 <------------- OSD.31 highest variation of 2.13 and space left  *1842G*

rbd has pool size 3
Now let us apply the formula :

min(osd.avail for osd in OSD_up)* len(osd.avail for osd in OSD_up) / pool.size() 

(1842*576)/3 = 1060992/3 = 353664 then convert in TB  =  353664/1024 = 345.375 TB

Which is same as current MAX available.
For RHCS 3 and RHCS 2.5

[min(osd.avail for osd in OSD_up) - ( min(osd.avail for osd in OSD_up).total_size * (1 - mon_osd_full_ratio)) ]* len(osd.avail for osd in OSD_up) /pool.size() 

((1842 - (3723 * (1 - 0.95)))*576) /3 = ((1842 - 186) * 576) /3 = 317952 = 310.5 TB

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Learn more

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.