Data distribution among OSD's is uneven
Environment
- Red Hat Ceph Storage 1.3.x
- Red Hat Ceph Storage 2.x
Issue
- The distribution of data among OSD's is uneven, Why?
- Why is the OSD filesystem utilization not balanced among the OSD nodes?
ceph osd dfimbalancereweight-by-utilizationfor OSD imbalance issue or tweaking crush is needed- Advance
CRUSHprofile tohammer nearfullwarnings and balancingCRUSHmap
Resolution
-
Ceph distributes data with the help of
CRUSHin the best possible way but it does not guarantee equal distribution. -
Check the general reasons mentioned in the Root Cause section and take the required actions mentioned below.
-
Take appropriate action to make the number of OSDs balanced among the OSD nodes.
-
Before applying any changes discussed below, do not forget to Throttle backfill and recovery for OSDs.
-
Couple of recommendations which can help to make best out of
CRUSHalgorithm: -
Please calculate the correct
Placement Groupcount for your cluster with the help of This content is not included.placement group calculator tool and if they need an increase then please use Ceph: How do I increase Placement Group (PG) count in a Ceph Cluster article. -
It is always better to use optimal
tunableif your clients and cluster can support it( Warning: for using optimaltunableceph clients and cluster should be in same version). -
For example if you are using
hammerrelease, then in this releaseoptimaltunable would behammer. -
You can check this via below given command:
$ ceph osd crush show-tunables "profile": "hammer", <======= profile field will have correct value. -
If you are changing tunables to
optimalwhich will make the profile ashammeror above. Then you can also changestrawalgorithm tostraw2which is supported fromhammerrelease.Reference articles for migrating straw algorithm to straw2:
-
If you are planning to change tunables from
legacytooptimalthen please make a note that this change will move a lot of data (data rebalance). You should have scheduled change window to make these changes may be the weekend when you have less load in the cluster. You can go directly fromlegacy(argonaut) tooptimal(for hammer release profile is hammer with straw2) in one go or you can also choose multiple windows argonaut -> bobtail -> firefly -> (hammer with straw2) -> (jewel), waiting for data rebalance to finish in each tunable change and let cluster to come back to HEALTH_OK including straw2 change. It is better to plan one big change window and go directly fromlegacy(argonaut or bobtail) tofireflyand then fromfireflytooptimal(for hammer optimal profile is hammer and for jewel optimal profile is jewel). -
For more information about
CRUSH tunables, please check crush-map-tunables.
- Recommended: Before Implementing tunable changes into ceph cluster it is recommended to check the new pg distribution with the help of this article: How can I test the impact CRUSH map tunable modifications will have on my pg distribution across OSDs in Red Hat Ceph Storage?
- If you are running
optimaltunable for your cluster and still you see the big difference in OSDs usage then you can take help oftest-reweight-by-utilizationandreweight-by-utilizationcommands.test-reweight-by-utilizationcommand was added inRed Hat Ceph Storage 1.3.2which has the version:0.94.5-12.el7cp.
-
test-reweight-by-utilization :
- Warning: Red Hat recommends a weight change value of .05 or smaller and do not reweight more than 10 OSDs at one time.
- Red Hat also recommends testing these reweight changes before applying them to the OSDs. You can use below given command:$ ceph osd test-reweight-by-utilization {threshold} {weight_change_amount} {number_of_OSDs} For example: $ ceph osd test-reweight-by-utilization 120 .05 10 -
To minimize the impact to the storage cluster, implement the OSD weight changes in small increments. For more information please check reweight-by-utilization section.
-
As now
$ceph osd reweight-by-utilizationhas{weight_change_amountand{number of osds }like we have intest-reweight-by-utilization.$ceph osd reweight-by-utilization --help osd reweight-by-utilization {} reweight OSDs by utilization [overload- {} {} {--no-increasing} percentage-for-consideration, default 120] -
Also, we have
--no-increasingoption with these commands:
--no-increasingis off by default. So increasing the osd weight is allowed using the reweight-by-utilization or test-reweight-by-utilization commands. If this option is used with these commands, it will help not to increase osd weight even the osd is underutilized.
Root Cause
- There are four main reasons identified for the imbalance in OSD filesystem utilization:
- Number of OSDs are not balanced among the OSD nodes in the cluster.
- PG count is not proper as per the number of OSDs, use case, target PGs per OSD and utilization.
- Inappropriate CRUSH tunables.
- OSDs backend storage is nearfull and need to add more osds in the cluster
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.