Ceph - When adding failure domains to a CRUSH map data movement is seen even when new failure domains have no weight set

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Ceph Storage 1.0
Red Hat Ceph Storage 1.1
Any upstream release prior to Firefly (0.80.x)

Issue

When adding new failure domains to a CRUSH map in a cluster running prior to RHCS 1.2.3 (firefly), data movement can be seen when the CRUSH map is updated even if the newly added failure domains have no weight associated with them.
In the example below we are adding racks 4 and 5 to an existing CRUSH map where racks 1, 2 and 3 already exist and have CRUSH weights associated with them. Racks 4 and 5 correctly have 0 weight set so updating the CRUSH map should not trigger any type of data movement at this time.
Example from an exported CRUSH map where racks 4 and 5 were added:

root default {
	id -1		# do not change unnecessarily
	# weight 1155.999
	alg straw
	hash 0	# rjenkins1
	item rack1 weight 405.280
	item rack2 weight 375.360
	item rack3 weight 375.360
	item rack4 weight 0
	item rack5 weight 0

After editing the CRUSH map and injecting the map with the new failure domains configured the cluster can be seen re-balancing the data. You can use 'ceph -s' to track the data movement occurring between placement groups.
Since the newly configured racks have 0 weight adding them to the cluster should not have triggered any data movement to the cluster.
Details on de-compiling and editing CRUSH maps can be found here: How to edit a CRUSH map

Resolution

Upgrade the cluster to the latest Red Hat supported version of Ceph. At the time of the writing of this article it is RHCS 1.3.1 (94.3 Hammer).
Upgrade ceph libs on clients to 1.3.1 to support the new tunables.
Set tunable profile to optimal and enable hashpspool on Ceph pools. (this will incur a large amount of data movement in the cluster, but will lead to less painful changes in the future. This can be set together to prevent double movements for enabling each feature, but hashpspool needs to be set on each Ceph pool.

# admin@ceph ~ ceph osd crush tunables optimal ; 'ceph osd pool set <name> hashpspool true

It is also recommended once at 1.3.1 on cluster and clients to set straw2 bucket hashing algorithm on the cluster which will help prevent small changes in the cluster from incurring massive data movements. This can be done by decompiling the CRUSH map and editing the alg to straw2 and reinjecting the new CRUSH map. This will also incur data movement on the cluster!

host Ceph003 {
	id -35		# do not change unnecessarily
	# weight 27.200
	'alg straw'   <-----------------------Change to 'straw2'
	hash 0	# rjenkins1
	item osd.0 weight 2.720
	item osd.1 weight 2.720
	item osd.2 weight 2.720
	item osd.4 weight 2.720
	item osd.6 weight 2.720
	item osd.11 weight 2.720
	item osd.12 weight 2.720
	item osd.13 weight 2.720
	item osd.155 weight 2.720
	item osd.169 weight 2.720

Root Cause

Running Bobtail tunables on a cluster can lead to strange behavior with CRUSH updates, such as data movements occurring because the new failure domains are being considered even though they have no weight and no OSD's under them.
Bobtail tunables seen below:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

Reviewing Ceph tunables profile can also be used to validate the tunable profile that is set:

admin@ceph: ~# ceph osd crush show-tunables
{ "choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 0,
"profile": "bobtail",
"optimal_tunables": 0,
"legacy_tunables": 0,
"require_feature_tunables": 1,
"require_feature_tunables2": 0}

Also running a Ceph version below firefly likely means that 'hashpspool' is not enabled on the Ceph data pools which can also lead to interesting data movements in the Ceph cluster with Crush map changes. The 'hashpspool' flag uses the NAME of the pool to seed the PG placement algorithm, so more randomized PG.id to OSD mapping layout of one pool vs another. This was introduced in Firefly and any pools created in Firefly will have this enabled by default. Any pool created prior to Firefly would not have this set by default.
You can verify if hashpspool is set by reviewing:

# admin@ceph ~ ceph osd dump |grp pool
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 15 flags hashpspool stripe_width 0

This flag can be set in the ceph.conf file under the OSD section:

osd pool default flag hashpspool = true

Or can be set manually via the CLI on each pool:

# admin@ceph ~ ceph osd pool set <poolname> hashpspool true

The addition of the chooseleaf_vary_r setting in Firefly (and possibly hashpspool) would address the data movement flaws being seen by CRUSH.

SBR

Ceph

Product(s)

Category

Troubleshoot

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.