Changing pg_num on a pool that uses a custom CRUSH ruleset, doesn't change the total PGs in the cluster, why?

Solution Verified - Updated

Environment

  • Red Hat Ceph Storage 1.2.3

  • Red Hat Ceph Storage 1.3

Issue

  • Changing the placement group number on a pool doesn't increase the total number of placement groups in the cluster.

  • The new pool on which the pg_num was changed, is using a custom ruleset.

  • The following steps were followed to reproduce this:

1.Create a new CRUSH ruleset

rule new_ruleset {
        ruleset 50
        type replicated
        min_size 1
        max_size 10
        step take sdd
        step chooseleaf firstn 0 type host
        step emit
}

2.Create a new pool and set the CRUSH ruleset to 'new_ruleset'

# ceph osd pool create <pool_name> 64 64 replicated new_ruleset

# ceph osd pool set <pool_name> crush_ruleset 50

3.Check the pool status

# ceph osd dump | grep new_pool
pool 9 'new_pool' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1574 flags hashpspool stripe_width 0

4.Increase the PG and PGP number on the new pool

# ceph osd pool set <pool_name> pg_num 128
set pool 9 pg_num to 128

# ceph osd pool set <pool_name> pgp_num 128
set pool 9 pgp_num to 128

5.The PG and PGP number changes for the pool, but the overall PG number does not increase for the cluster.

# ceph osd dump | grep new_pool
pool 9 'new_pool' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins 
pg_num 128 pgp_num 128 last_change 1581 flags hashpspool stripe_width 0

# ceph -s
    cluster <cluster_id>
    health HEALTH_OK
    monmap e2: 3 mons at 
    {mon-1=10.0.1.100:6789/0,mon-2=10.0.1.101:6789/0,mon-3=10.0.1.102:6789/0},
    election epoch 84, quorum 0,1,2 mon-1, mon-2, mon-3
    osdmap e1582: 21 osds: 15 up, 15 in
    pgmap v560709: 1472 pgs, 5 pools, 285 GB data, 73352 objects
    80158 MB used, 695 GB / 779 GB avail
    1472 active+clean

Resolution

  • This problem will come up if the rulesets are not numbered sequentially.

  • The new ruleset in this case was numbered '50', which is of course not sequential from the previous 'default' ruleset, which has a number of '0'.

  • A bugzilla has been opened to fix this problem, and can be seen at This content is not included.This content is not included.https://bugzilla.redhat.com/show_bug.cgi?id=1258953

  • The workaround is to download the CRUSH map, change the custom ruleset number to a consecutive number compared to the previous ruleset, and re-inject the CRUSH map back.

  • The steps are:

1.Get the current crush map

# sudo ceph osd getcrushmap -o /tmp/crushmap.bin

2.Decode the original binary crushmap file to a text file

# sudo crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt

3.Edit the CRUSH map to set the new ruleset number

4.Recompile the CRUSH map back to the binary format

# crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.new

5.Test the CRUSH map rules

# crushtool --test -i /tmp/crushmap.new --num-rep 3 --rule 1 --show-statistics

6.Re-inject the CRUSH map

# sudo ceph osd setcrushmap -i /tmp/crushmap.new

7.This change should show the PG changes in 'ceph -s'.

Root Cause

  • Creating a new ruleset without numbering it consecutively, can bring in problems like the overall PG number not increasing for the cluster.

  • As of now, no other problems have been found, but not sure if something else can come up.

SBR
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.