ceph df MAX AVAIL is incorrect for simple replicated pool
Environment
- Red Hat Ceph Storage 1.3.z
- Upstream Hammer release
- Red Hat Ceph Storage 2.y
- Upstream Jewel release
- Red Hat Ceph Storage 3.y
- Upstream Luminous release
Issue
- ceph df MAX AVAIL is incorrect for simple replicated pool
Resolution
The MAX AVAIL value does not represent the amount of free space. Rather it represents the amount that can be written until the highest used OSD will get full. It is a complicated function of the replication or erasure code used, the CRUSH rule that maps storage to devices, the utilization of those devices, and the configured mon_osd_full_ratio.
Throughout the History ceph used different formulas to calculate this value:
-
Ceph uses below given formula to calculate
MAX AVAILvalue :min(osd.avail for osd in OSD_up)* len(osd.avail for osd in OSD_up) /pool.size() -
In ceph hammer/jewel code release function
PGMonitor::dump_pool_stats()calculates this value and this function definition is here -Content from github.com is not included.Content from github.com is not included.https://github.com/ceph/ceph/blob/jewel/src/mon/PGMonitor.cc. Please search forPGMonitor::dump_pool_statsand this function then calls -get_rule_avail(). -
In Ceph luminous code release function
PGMapDigest::dump_pool_stats_full()calculates this value and this function definition is here - Content from github.com is not included.Content from github.com is not included.https://github.com/ceph/ceph/blob/luminous/src/mon/PGMap.cc. Please search forPGMapDigest::dump_pool_stats_fulland this function calls -get_rule_avail().
-
min(osd.avail for osd in OSD_up) : Minimum space left in an OSD in up set in pool crush ruleset and same what Sage has suggested in Content from tracker.ceph.com is not included.upstream tracker #13844 that
your usage is bounded by osd.X. -
len(osd.avail for osd in OSD_up) : Number of OSDs in UP set in pool crush ruleset
-
pool.size() : pool replication size
-
Ceph count every OSD that can be taken by the crush map by following the rule used by the pool in question.
-
in RHCS 2.5 the
MAX AVAILvalue calculation was changed and now the 'mon_osd_full_ratio' is taken into account too[min(osd.avail for osd in OSD_up) - ( min(osd.avail for osd in OSD_up).total_size * (1 - mon_osd_full_ratio)) ]* len(osd.avail for osd in OSD_up) /pool.size()
-mon_osd_full_ratio = 0.95
- mon/PGMap: factor mon_osd_full_ratio into MAX AVAIL calc backport for for Jewel 10.2.10 on which RHCS 2.5 is based on Content from tracker.ceph.com is not included.upstream tracker #20036
-
in In RHCS 3 the
MAX AVAILvalue calculation was changed and now the 'full_ratio' is taken into account - default value 0.95[min(osd.avail for osd in OSD_up) - ( min(osd.avail for osd in OSD_up).total_size * (1 - full_ratio)) ]* len(osd.avail for osd in OSD_up) /pool.size() -
what is your current 'full_ratio' value you can get:
# ceph osd dump | grep full_ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
- and change it with following command:
# ceph osd set-full-ratio 0.9
# ceph osd dump | grep full_ratio
full_ratio 0.9
backfillfull_ratio 0.9
nearfull_ratio 0.85
- github commits for Content from github.com is not included.mon/OSDMonitor: implement new 'osd set-[near]full-ratio ...' commands and Content from github.com is not included.mon: Use currently configure full ratio to determine available space
Root Cause
- Ceph
poolusage could be bounded by osd.X as given in this Content from tracker.ceph.com is not included.upstream tracker #13844. - For more information please check
resolutionanddiagnosticsections.
Diagnostic Steps
- Ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
3195T 2436T 759T 23.77
rbd 57 252T 7.90 345T 66220069
- Now check the rbd pool ruleset :
pool 57 'rbd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 16384 pgp_num 16384 last_change 205704 flags hashpspool stripe_width 0
# rules
rule mkt_ext_ruleset {
ruleset 1 <===========================================
type replicated
min_size 1
max_size 7
step take default
step chooseleaf firstn 0 type host
step emit
}
root default {
id -2 # do not change unnecessarily
# weight 2096.640
alg straw
hash 0 # rjenkins1
item node_a weight 174.720
item node_b weight 174.720
item node_c weight 174.720
item node_d weight 174.720
item node_e weight 174.720
item node_f weight 174.720
item node_g weight 174.720
item node_h weight 174.720
item node_i weight 174.720
item node_j weight 174.720
item node_k weight 174.720
item node_l weight 174.720
}
-
I am not adding all the host_group and host bucket list for this rule here.
-
But I have checked this rule has OSDs starting from OSD.0 to OSD.575 means total : 576 OSDs.
-
Now let us check the minimum space left osd starting from OSD.0 to OSD.575 , checking
ceph osd df:
31 3.64000 1.00000 3723G 1881G 1842G 50.53 2.13 <------------- OSD.31 highest variation of 2.13 and space left *1842G*
-
rbd has pool size 3
-
Now let us apply the formula :
min(osd.avail for osd in OSD_up)* len(osd.avail for osd in OSD_up) / pool.size()
(1842*576)/3 = 1060992/3 = 353664 then convert in TB = 353664/1024 = 345.375 TB
-
Which is same as current MAX available.
-
For RHCS 3 and RHCS 2.5
[min(osd.avail for osd in OSD_up) - ( min(osd.avail for osd in OSD_up).total_size * (1 - mon_osd_full_ratio)) ]* len(osd.avail for osd in OSD_up) /pool.size()
((1842 - (3723 * (1 - 0.95)))*576) /3 = ((1842 - 186) * 576) /3 = 317952 = 310.5 TB
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.