Ceph/ODF: Slow backfill and slow scrub/deep-scrub under mClock I/O scheduler.
Environment
- Red Hat OpenShift Container Platform (OCP) 4.x
- Red Hat OpenShift Container Storage (OCS) 4.x
- Red Hat OpenShift Data Foundation (ODF) 4.x
- Red Hat Ceph Storage (RHCS) 6.x
- Red Hat Ceph Storage (RHCS) 7.x
- Red Hat Ceph Storage (RHCS) 8.x
- Ceph Object Storage Daemon (OSD)
Issue
- Slow backfill, slow scrub/deep-scrub or slow snaptrims with the mClock I/O scheduler.
- Very low or too high values for certain OSDs for
osd_mclock_max_capacity_iops_hddorosd_mclock_max_capacity_iops_ssd. See the Diagnostic section for an example.
Resolution
-
The first step is to check which
mClockprofile which is used. The default isbalancedand if more performance for recovery or maintenance operations are desired the profile can be changed tohigh_recovery_ops. The cluster should be monitored to ensure that client operations are not suffering while thehigh_recovery_opsprofile is used. (For completeness we also can configurehigh_client_opsif on the contrary we would like to favor client operations over recovery operations). Please see the documentation [1] Link for more details onmClockprofiles and how to change them. -
If we still see that maintenance and backfill operations do not work at the expected speed check the
mClockconfiguration settings for MAX IOP Capacity per OSD (see the diagnostic section for an example of mis-identified capacitys). Newer versions (see artifact table) contain a fix which should correctly set default values if the benchmark was not able to collect realistic values. If your cluster runs older versions or if your devices have been mis-identified (SSD as HDD or vice versa) then configure theosd_mclock_max_capacity_iops_ssdorosd_mclock_max_capacity_iops_hddmanually with their correct MAX IOPs capacity (See Documentation [3]). If proper benchmarking can not be done (Server has too high Load, benchmarks still seen as unrealistic) then set those values to their defaults.ⓘ If your devices have been wrongly detected as HDD or SSD you can use the below configuration options to provide immediate relief configuring corresponding MAX Capacities, but ensure that the drive type is properly detected by the OS as there are other performance impacting configurations which are set for OSDs based on the detected device class. See KCS Ceph solution 3937321 on how to fix this in a ceph standalone environment and KCS ODF solution 6547891 on how to fix that in ODF.
ⓘ Currently for HDD the default is 315 and for SSD the default is 21500 (See diagnostic section about reviewing the default values in your cluster).
-
Configuring the correct mClock settings for all OSDs. Change
SSD_MAX_IOPSandHDD_MAX_IOPS:# for OSD in $(ceph osd ls); do ceph config rm osd.${OSD} osd_mclock_max_capacity_iops_ssd; sleep 0.2; done # for OSD in $(ceph osd ls); do ceph config rm osd.${OSD} osd_mclock_max_capacity_iops_hdd; sleep 0.2; done # ceph config set osd osd_mclock_max_capacity_iops_ssd `SSD_MAX_IOPS` # ceph config set osd osd_mclock_max_capacity_iops_hdd `HDD_MAX_IOPS` -
Configuring correct mClock settings for individual OSDs. Change
SSD_MAX_IOPSandHDD_MAX_IOPS:Use either of the following commands to set the max capacity according to the deviceclass of the OSD:
# ceph config set osd.`OSD-ID` osd_mclock_max_capacity_iops_ssd `SSD_MAX_IOPS` # ceph config set osd.`OSD-ID` osd_mclock_max_capacity_iops_hdd `HDD_MAX_IOPS` -
Configuring mClock Profile to
high_client_opsto ensure client I/O is always prioritized. :Depending on the situation, the profile can also be set to
balancedorhigh_recovery_ops.# ceph config set osd osd_mclock_profile high_client_opsHowever, the
balancedprofile does have the benefit of providing more IOPs to background and maintenance tasks when the IOPs are not needed for client operations. Proper testing should be done to see if running withhigh_client_opswill provide actual benefits over thebalancedprofile.
Correcting those values should provide immediate relief.
-
-
For
HDDOSDs see Documentation [2]Shard configuration for HDD OSDs when using mClockand make sure those are correctly set.ⓘ Changing the shard and threads per shard configuration does need an OSD restart to take effect.
⚠ This configuration change must only be executed for actual HDD OSDs. Do not configure this setting for SSDs, NVMEs, or devices wrongly detected as HDD by the Operating System.
| Parameter | Setting |
|----------------------------------|---------|
| osd_op_num_shards_hdd | 1 |
| osd_op_num_threads_per_shard_hdd | 5 | -
Perceptions of slow
backfilling.-
A
recoveryoperation (caused by some kind of hardware issue). A recovery operation is given high priority asData Is At Risk. This is true across all currentmClockprofiles. Using thehigh_recovery_opscan be used to improve recovery rates a bit further. -
A
backfilloperation (caused by an OSD set tooutand consequently being set back in, or scaleout operations, or PG count adjustments) is different then arecoveryoperation (caused by some kind of hardware issue). Data is moved to new locations in addition to all current copies. TheData Is Not At Risk. Therefore withmClockyou can expect backfills to be generally slower when compared todegraded recoveryoperations. This might cause Customer concerns but it is important to know thatData Is Not At Riskand the lower priority for backfill is to ensure client operations are not impacted during the process.
To tune the backfilling for either case there are the following options available:
- Using the
high_recovery_opsmClock profile if you are backfilling missing data. - If higher rates are still desired set the
osd_mclock_override_recovery_settingstotrueand then override theosd_max_backfills(for backfilling with data not at risk) orosd_recovery_max_active_[hdd|ssd](for backfilling missing data) with higher values.
-
Root Cause
The parameters osd_mclock_max_capacity_iops_ssd and osd_mclock_max_capacity_iops_hdd may be set too low or to high which impacts certain I/O operations on OSDs caused by those unrealistic values. The parameters are set by a quick benchmark evaluation test on OSD startup when no osd_mclock_max_capacity_iops_[ssd|hdd] values are set already. Future versions will have a better way to benchmark an OSD to get more realistic IOP capabilities for disk drives (see BZ #2370442 from the Artifacts section).
⚠ The following root cause comment does only refer to actual HDD OSDs.
For HDD based OSDs, in versions below 8.0z2, 7.1z3 and 6.1z9 the osd_op_num_shards_hdd and osd_op_num_threads_per_shard_hdd need to be reconfigured to provide more consistent performance [2].
Artifacts
Bugzilla / Jira
| Product | Link | Errata | Version | Comment |
|---|---|---|---|---|
| RHCS 8.0 | BZ This content is not included.#2294594 | RHBA-2025:2457 | 8.0z2 | New shard configuration for HDD clusters |
| RHCS 8.0 | BZ This content is not included.#2292517 | RHBA-2025:2457 | 8.0z2 | Configure Defaults if benchmark provides unrealistic data |
| RHCS 7.1 | BZ This content is not included.#2299480 | RHBA-2025:1770 | 7.1z3 | New shard configuration for HDD clusters |
| RHCS 7.1 | BZ This content is not included.#2330755 | RHBA-2025:1770 | 7.1z3 | Configure Defaults if benchmark provides unrealistic data |
| RHCS 6.1 | BZ This content is not included.#2299482 | RHSA-2025:4238 | 6.1z9 | New shard configuration for HDD clusters |
| RHCS 9.0 | BZ This content is not included.#2370442 | TBD | TBD | Reimplementation of Capacity Benchmarking |
Documentation Links
Diagnostic Steps
The IBM documentation [4] has additional troubleshooting scenarios for mClock. See the Ceph Upstream documentation [5] for a detailed configuration reference.
Example of unrealistic max capacity IOPs configured by the on startup benchmark for OSDs
# sudo ceph config dump | grep -E "(WHO|mclock_max_capacity)"
WHO MASK LEVEL OPTION VALUE
osd.107 basic osd_mclock_max_capacity_iops_hdd 2.985434
osd.112 basic osd_mclock_max_capacity_iops_hdd 0.198104
osd.12 basic osd_mclock_max_capacity_iops_hdd 404.875447
osd.3 basic osd_mclock_max_capacity_iops_hdd 256.640426
osd.36 basic osd_mclock_max_capacity_iops_hdd 253.152337
osd.41 basic osd_mclock_max_capacity_iops_hdd 211.954762
osd.55 basic osd_mclock_max_capacity_iops_hdd 263.840410
osd.60 basic osd_mclock_max_capacity_iops_hdd 419.974024
osd.75 basic osd_mclock_max_capacity_iops_hdd 404.058225
osd.80 basic osd_mclock_max_capacity_iops_hdd 66.958426
osd.85 basic osd_mclock_max_capacity_iops_hdd 0.198044
Review of default settings for MAX Capacity
# sudo ceph config help osd_mclock_max_capacity_iops_ssd
osd_mclock_max_capacity_iops_ssd - Max random write IOPS capacity (at 4 KiB block size) to consider per OSD (for solid state media)
(float, basic)
Default: 21500.000000
Can update at runtime: true
Services: [osd]
# sudo ceph config help osd_mclock_max_capacity_iops_hdd
osd_mclock_max_capacity_iops_hdd - Max random write IOPS capacity (at 4KiB block size) to consider per OSD (for rotational media)
(float, basic)
Default: 315.000000
Can update at runtime: true
Services: [osd]
Determining the speed of your OSDs
# for X in $(ceph osd ls); do echo -n "Look at osd.${X}: "; ceph tell osd.${X} cache drop; ceph tell osd.${X} bench 12288000 4096 4194304 100 2>&1 | grep iops; done
To test a subset of OSDs, build a list (/tmp/OSDs) with one OSD number per line and run this benchmark
# for X in $(cat /tmp/OSDs); do echo -n "Look at osd.${X}: "; ceph tell osd.${X} cache drop; ceph tell osd.${X} bench 12288000 4096 4194304 100 2>&1 | grep iops; done
Based on the results and the fact that the system is in use during testing, round up by 5000 and apply that number as the value for osd_mclock_max_capacity_iops_ssd. We do not want a value which is too low and leads to mClock "cutting the tall grass" nor do we want some ridiculously large number which would misconfigured mClock.
The same is true for OSDs back by HDD devices, but the scale is much different. Round up by 100 and apply that number as the value for osd_mclock_max_capacity_iops_hdd. Support has never seen an HDD which warranted a value of more than 800.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.