Ceph: How do I increase Placement Group (PG) count in a Ceph Cluster
Environment
All versions of Red Hat Ceph Storage
Issue
Note:
- This is the most intensive process that can be performed on a Ceph cluster, and can have drastic performance impact if not done in a slow and methodical fashion.
- Once the data starts moving for a chunk of Placement Groups (PGs) ( in the increasing pgp_num section ), it cannot be stopped or reversed and must be allowed to complete.
- It is advised that this process be performed off-hours, and all clients alerted to the potential performance impact well ahead of time.
Overview:
Having proper Placement Group (PG) count is a critical part of ensuring top performance and best data distribution in your Ceph cluster.
- The This content is not included.Ceph PG calc tool should be referenced for optimal values.
- Care should be taken to maintain between 100 and 200 PGs per OSD ratio as detailed in the Ceph PG calc tool.
- The current PG count per OSD can be viewed in the
PGScolumn of theceph osd df treecommand.
- The current PG count per OSD can be viewed in the
- The only time increasing the PG count is required is if you expand your cluster with more OSDs, such that the ratio drops to or below 100 PGs per OSD ratio in your cluster, or if the initial PG count was not properly planned.
- Because of the high intensity of this operation, careful planning should be used to ensure that if a PG increase evolution is needed, that the increase goes to a level which will cover any possible cluster expansion into the foreseeable future.
Resolution
Note for Red Hat Ceph Storage (RHCS) versions 4.x and 5.x
RHCS 4.x and 5.x does not require the pgp_num value to be set. This will be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num is required to be incremented for the necessary pools. The final value of pg_num should be a power of 2 (for example, 8, 16 or 32).
CAUTION:
Please ensure the cluster is in a HEALTH_OK state (or possibly HEALTH_WARN due to too few PGs) prior to commencing.
The PG increase process takes place in two major steps, each of which will be detailed below.
- Increase the
pg_numvalue in small increments until the desired PG count is reached.
- In this part of the process, the cluster creates the new PGs, but does not enable them for data placement.
- Increase the
pgp_numvalue in small increments until the desired PG count is reached.
- In this part of the process, you are increasing the subset of previously created PGs that will be used for data placement.
- Data will move during this part of the evolution.
Note:
- For RHCS 4.x and later, the
pgp_numvalue is increased automatically by the Ceph Manager (MGR) process. - The PG count settings (
pg_numandpgp_num) are configured per-pool.- Each pool which has a low PG count must be adjusted separately.
- It would be prudent to adjust the smaller, less active pools first to get a feel for how the cluster will react during the process.
- Due to the amount of data movement, and potential client IO impact, it is recommended to throttle the cluster's backfill and recovery values to minimize client IO impact.
- This will cause the overall process to take longer as data will not move as quickly.
- It is also advised to disable scrub and deep scrub operations during the process to limit the potentially compounding IO load.
- Increasing
pg_numorpgp_numby 1 or 2 each time may generate the Ceph warning,pool(s) have non-power-of-two.- This can be safely ignored as the final value will be a power of 2.
Recommended initial throttle values:
# ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
# ceph osd set noscrub
# ceph osd set nodeep-scrub
- This command will update the settings on all running OSD processes in the cluster.
- These values can be increased if faster processing is desired at the expense of client IO performance.
The Default values can be set with:
# ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
# ceph osd unset noscrub
# ceph osd unset nodeep-scrub
Begin the Process:
- Increase the
pg_numvalue in small increments until the desired PG count is reached.
a. Determine your starting PG step size.- It is recommended to keep this very low at first, and increase the PG step size to a level which is comfortable for your cluster.
- The final
pg_numvalue should be a power of 2. pg_numshould be incremented by a small value at first to determine how the cluster handles the change.- Starting low ( 4, 8, 16 ), and increasing the step size as the cluster impact is determined.
- The optimal value will vary based on pool size, OSD count and client IO load and should be at a comfortable trade off between expedited progress and client impact.
b. Increment the pool pg_num:
# ceph osd pool set <pool_name> pg_num <new_value>
c. Monitor ceph -s
- PGs will move from creating to active+clean state.
- Wait for ALL PGs to exit the creating state
d. Once all PGs are out of creating state, repeat steps 1.a through 1.c until the desired PG count is reached.
- For RHCS 3.x and earlier, increase the
pgp_numvalue in small increments until the desired PG count is reached.
a. Determine your starting PG step size.- It is recommended to keep this very low at first, and increase the PG step size to a level which is comfortable for your cluster.
- The final
pgp_numvalue should be a power of 2. pgp_numshould be incremented by a small value at first to determine how the cluster handles the change.- Starting low ( 4, 8, 16 ), and increasing the step size as the cluster impact is determined.
- The optimal value will vary based on pool size, OSD count and client IO load and should be at a comfortable trade off between expedited progress and client impact.
b. Increment the pool pgp_num:
# ceph osd pool set <pool_name> pgp_num <new_value>
c. Monitor ceph -s
- PGs will move through various states, possibly including peering, wait_backfill , backfilling, recover, etc.. and will eventually move to active+clean state as the cluster data reaches the new optimal position.
- Wait for ALL PGs to be active+clean
d. Once all PGs are active+clean state, repeat steps 2.a through 2.c until the desired PG count is reached.
Repeat steps 1.a->1.c and 2.a->2.c for each pool needing to have the PG count increased.
Once all operations are complete (or when suspending operations for a specific off-hours period), ensure all previously adjusted throttle values are returned to your desired settings.
The Default values can be set with:
# ceph tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
# ceph osd unset noscrub
# ceph osd unset nodeep-scrub
Notes on this process:
- Any PG in a
peeringstate will not serve IO requests.- Thus, during this peering time, it is likely client IO will be impacted and potentially, slow requests may be logged (any request taking more than 30 seconds to complete).
- After
pgp_numis increased, each OSD will re-evaluate the contents of all PGs for the incremented pool to ensure every object is in the proper location given the new PG count. - Due to the need to re-evaluate the contents of every PG in the pool, even with the backfill and recovery throttles in place, client IO can experience performance degradation.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.