Data distribution among OSD's is uneven

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Ceph Storage 1.3.x
Red Hat Ceph Storage 2.x

Issue

The distribution of data among OSD's is uneven, Why?
Why is the OSD filesystem utilization not balanced among the OSD nodes?
ceph osd df imbalance
reweight-by-utilization for OSD imbalance issue or tweaking crush is needed
Advance CRUSH profile to hammer
nearfull warnings and balancing CRUSH map

Resolution

Ceph distributes data with the help of CRUSH in the best possible way but it does not guarantee equal distribution.
Check the general reasons mentioned in the Root Cause section and take the required actions mentioned below.
Take appropriate action to make the number of OSDs balanced among the OSD nodes.
Before applying any changes discussed below, do not forget to Throttle backfill and recovery for OSDs.
Couple of recommendations which can help to make best out of CRUSH algorithm:
Please calculate the correct Placement Group count for your cluster with the help of This content is not included.placement group calculator tool and if they need an increase then please use Ceph: How do I increase Placement Group (PG) count in a Ceph Cluster article.
It is always better to use optimal tunable if your clients and cluster can support it( Warning: for using optimal tunable ceph clients and cluster should be in same version).
For example if you are using hammer release, then in this release optimal tunable would be hammer.

You can check this via below given command:

      $ ceph osd crush show-tunables 
      "profile": "hammer", <======= profile field will have correct value.

If you are changing tunables to optimal which will make the profile as hammer or above. Then you can also change straw algorithm to straw2 which is supported from hammer release.

Reference articles for migrating straw algorithm to straw2:
If you are planning to change tunables from legacy to optimal then please make a note that this change will move a lot of data (data rebalance). You should have scheduled change window to make these changes may be the weekend when you have less load in the cluster. You can go directly from legacy (argonaut) to optimal(for hammer release profile is hammer with straw2) in one go or you can also choose multiple windows argonaut -> bobtail -> firefly -> (hammer with straw2) -> (jewel), waiting for data rebalance to finish in each tunable change and let cluster to come back to HEALTH_OK including straw2 change. It is better to plan one big change window and go directly from legacy(argonaut or bobtail) to firefly and then from firefly to optimal(for hammer optimal profile is hammer and for jewel optimal profile is jewel).
For more information about CRUSH tunables, please check crush-map-tunables.

Recommended: Before Implementing tunable changes into ceph cluster it is recommended to check the new pg distribution with the help of this article: How can I test the impact CRUSH map tunable modifications will have on my pg distribution across OSDs in Red Hat Ceph Storage?

If you are running optimal tunable for your cluster and still you see the big difference in OSDs usage then you can take help of test-reweight-by-utilization and reweight-by-utilization commands. test-reweight-by-utilization command was added in Red Hat Ceph Storage 1.3.2 which has the version: 0.94.5-12.el7cp.

test-reweight-by-utilization :
- Warning: Red Hat recommends a weight change value of .05 or smaller and do not reweight more than 10 OSDs at one time.
- Red Hat also recommends testing these reweight changes before applying them to the OSDs. You can use below given command:
```
       $ ceph osd test-reweight-by-utilization {threshold} {weight_change_amount} {number_of_OSDs}
        For example: $ ceph osd test-reweight-by-utilization 120 .05 10
```
To minimize the impact to the storage cluster, implement the OSD weight changes in small increments. For more information please check reweight-by-utilization section.

As now $ceph osd reweight-by-utilization has {weight_change_amount and {number of osds } like we have in test-reweight-by-utilization.

      $ceph osd reweight-by-utilization --help 
                    osd reweight-by-utilization {} reweight OSDs by utilization 
                    [overload- {} {} {--no-increasing} percentage-for-consideration, default 120]

Also, we have --no-increasing option with these commands:

--no-increasing is off by default. So increasing the osd weight is allowed using the reweight-by-utilization or test-reweight-by-utilization commands. If this option is used with these commands, it will help not to increase osd weight even the osd is underutilized.

Root Cause

There are four main reasons identified for the imbalance in OSD filesystem utilization:
1. Number of OSDs are not balanced among the OSD nodes in the cluster.
2. PG count is not proper as per the number of OSDs, use case, target PGs per OSD and utilization.
3. Inappropriate CRUSH tunables.
4. OSDs backend storage is nearfull and need to add more osds in the cluster

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Configure

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.