Ceph: How do I increase Placement Group (PG) count in a Ceph Cluster

Solution Verified - Updated 15 Jan 2025

Environment

All versions of Red Hat Ceph Storage

Issue

Note:

This is the most intensive process that can be performed on a Ceph cluster, and can have drastic performance impact if not done in a slow and methodical fashion.
Once the data starts moving for a chunk of Placement Groups (PGs) ( in the increasing pgp_num section ), it cannot be stopped or reversed and must be allowed to complete.
It is advised that this process be performed off-hours, and all clients alerted to the potential performance impact well ahead of time.

Overview:

Having proper Placement Group (PG) count is a critical part of ensuring top performance and best data distribution in your Ceph cluster.

The This content is not included.Ceph PG calc tool should be referenced for optimal values.
Care should be taken to maintain between 100 and 200 PGs per OSD ratio as detailed in the Ceph PG calc tool.
- The current PG count per OSD can be viewed in the PGS column of the ceph osd df tree command.
The only time increasing the PG count is required is if you expand your cluster with more OSDs, such that the ratio drops to or below 100 PGs per OSD ratio in your cluster, or if the initial PG count was not properly planned.
Because of the high intensity of this operation, careful planning should be used to ensure that if a PG increase evolution is needed, that the increase goes to a level which will cover any possible cluster expansion into the foreseeable future.

Resolution

Note for Red Hat Ceph Storage (RHCS) versions 4.x and 5.x

RHCS 4.x and 5.x does not require the pgp_num value to be set. This will be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num is required to be incremented for the necessary pools. The final value of pg_num should be a power of 2 (for example, 8, 16 or 32).

CAUTION:

Please ensure the cluster is in a HEALTH_OK state (or possibly HEALTH_WARN due to too few PGs) prior to commencing.

The PG increase process takes place in two major steps, each of which will be detailed below.

Increase the pg_num value in small increments until the desired PG count is reached.

In this part of the process, the cluster creates the new PGs, but does not enable them for data placement.

Increase the pgp_num value in small increments until the desired PG count is reached.

In this part of the process, you are increasing the subset of previously created PGs that will be used for data placement.
Data will move during this part of the evolution.

Note:

For RHCS 4.x and later, the pgp_num value is increased automatically by the Ceph Manager (MGR) process.
The PG count settings ( pg_num and pgp_num ) are configured per-pool.
- Each pool which has a low PG count must be adjusted separately.
- It would be prudent to adjust the smaller, less active pools first to get a feel for how the cluster will react during the process.
Due to the amount of data movement, and potential client IO impact, it is recommended to throttle the cluster's backfill and recovery values to minimize client IO impact.
- This will cause the overall process to take longer as data will not move as quickly.
It is also advised to disable scrub and deep scrub operations during the process to limit the potentially compounding IO load.
Increasing pg_num or pgp_num by 1 or 2 each time may generate the Ceph warning, pool(s) have non-power-of-two.
- This can be safely ignored as the final value will be a power of 2.

Recommended initial throttle values:

# ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
# ceph osd set noscrub
# ceph osd set nodeep-scrub

This command will update the settings on all running OSD processes in the cluster.
These values can be increased if faster processing is desired at the expense of client IO performance.

The Default values can be set with:

# ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
# ceph osd unset noscrub
# ceph osd unset nodeep-scrub

Begin the Process:

Increase the pg_num value in small increments until the desired PG count is reached.
a. Determine your starting PG step size.
- It is recommended to keep this very low at first, and increase the PG step size to a level which is comfortable for your cluster.
- The final pg_num value should be a power of 2.
- pg_num should be incremented by a small value at first to determine how the cluster handles the change.
- Starting low ( 4, 8, 16 ), and increasing the step size as the cluster impact is determined.
- The optimal value will vary based on pool size, OSD count and client IO load and should be at a comfortable trade off between expedited progress and client impact.

b. Increment the pool pg_num:

# ceph osd pool set <pool_name> pg_num <new_value>

c. Monitor ceph -s
- PGs will move from creating to active+clean state.
- Wait for ALL PGs to exit the creating state
d. Once all PGs are out of creating state, repeat steps 1.a through 1.c until the desired PG count is reached.

For RHCS 3.x and earlier, increase the pgp_num value in small increments until the desired PG count is reached.
a. Determine your starting PG step size.
- It is recommended to keep this very low at first, and increase the PG step size to a level which is comfortable for your cluster.
- The final pgp_num value should be a power of 2.
- pgp_num should be incremented by a small value at first to determine how the cluster handles the change.
- Starting low ( 4, 8, 16 ), and increasing the step size as the cluster impact is determined.
- The optimal value will vary based on pool size, OSD count and client IO load and should be at a comfortable trade off between expedited progress and client impact.

b. Increment the pool pgp_num:

# ceph osd pool set <pool_name> pgp_num <new_value>

c. Monitor ceph -s
- PGs will move through various states, possibly including peering, wait_backfill , backfilling, recover, etc.. and will eventually move to active+clean state as the cluster data reaches the new optimal position.
- Wait for ALL PGs to be active+clean
d. Once all PGs are active+clean state, repeat steps 2.a through 2.c until the desired PG count is reached.

Repeat steps 1.a->1.c and 2.a->2.c for each pool needing to have the PG count increased.

Once all operations are complete (or when suspending operations for a specific off-hours period), ensure all previously adjusted throttle values are returned to your desired settings.

The Default values can be set with:

# ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
# ceph osd unset noscrub
# ceph osd unset nodeep-scrub

Notes on this process:

Any PG in a peering state will not serve IO requests.
- Thus, during this peering time, it is likely client IO will be impacted and potentially, slow requests may be logged (any request taking more than 30 seconds to complete).
After pgp_num is increased, each OSD will re-evaluate the contents of all PGs for the incremented pool to ensure every object is in the proper location given the new PG count.
Due to the need to re-evaluate the contents of every PG in the pool, even with the backfill and recovery throttles in place, client IO can experience performance degradation.

SBR

Ceph

Product(s)

Category

Performance tune

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.