Ceph/ODF: Slow backfill and slow scrub/deep-scrub under mClock I/O scheduler.

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (OCP) 4.x
  • Red Hat OpenShift Container Storage (OCS) 4.x
  • Red Hat OpenShift Data Foundation (ODF) 4.x
  • Red Hat Ceph Storage (RHCS) 6.x
  • Red Hat Ceph Storage (RHCS) 7.x
  • Red Hat Ceph Storage (RHCS) 8.x
  • Ceph Object Storage Daemon (OSD)

Issue

  • Slow backfill, slow scrub/deep-scrub or slow snaptrims with the mClock I/O scheduler.
  • Very low or too high values for certain OSDs for osd_mclock_max_capacity_iops_hdd or osd_mclock_max_capacity_iops_ssd. See the Diagnostic section for an example.

Resolution

  1. The first step is to check which mClock profile which is used. The default is balanced and if more performance for recovery or maintenance operations are desired the profile can be changed to high_recovery_ops. The cluster should be monitored to ensure that client operations are not suffering while the high_recovery_ops profile is used. (For completeness we also can configure high_client_ops if on the contrary we would like to favor client operations over recovery operations). Please see the documentation [1] Link for more details on mClock profiles and how to change them.

  2. If we still see that maintenance and backfill operations do not work at the expected speed check the mClock configuration settings for MAX IOP Capacity per OSD (see the diagnostic section for an example of mis-identified capacitys). Newer versions (see artifact table) contain a fix which should correctly set default values if the benchmark was not able to collect realistic values. If your cluster runs older versions or if your devices have been mis-identified (SSD as HDD or vice versa) then configure the osd_mclock_max_capacity_iops_ssd or osd_mclock_max_capacity_iops_hdd manually with their correct MAX IOPs capacity (See Documentation [3]). If proper benchmarking can not be done (Server has too high Load, benchmarks still seen as unrealistic) then set those values to their defaults.

    If your devices have been wrongly detected as HDD or SSD you can use the below configuration options to provide immediate relief configuring corresponding MAX Capacities, but ensure that the drive type is properly detected by the OS as there are other performance impacting configurations which are set for OSDs based on the detected device class. See KCS Ceph solution 3937321 on how to fix this in a ceph standalone environment and KCS ODF solution 6547891 on how to fix that in ODF.

    Currently for HDD the default is 315 and for SSD the default is 21500 (See diagnostic section about reviewing the default values in your cluster).

    1. Configuring the correct mClock settings for all OSDs. Change SSD_MAX_IOPS and HDD_MAX_IOPS:

      # for OSD in $(ceph osd ls); do ceph config rm osd.${OSD} osd_mclock_max_capacity_iops_ssd; sleep 0.2; done
      # for OSD in $(ceph osd ls); do ceph config rm osd.${OSD} osd_mclock_max_capacity_iops_hdd; sleep 0.2; done
      # ceph config set osd osd_mclock_max_capacity_iops_ssd `SSD_MAX_IOPS`
      # ceph config set osd osd_mclock_max_capacity_iops_hdd `HDD_MAX_IOPS`
      
    2. Configuring correct mClock settings for individual OSDs. Change SSD_MAX_IOPS and HDD_MAX_IOPS:

      Use either of the following commands to set the max capacity according to the deviceclass of the OSD:

      # ceph config set osd.`OSD-ID` osd_mclock_max_capacity_iops_ssd `SSD_MAX_IOPS`
      # ceph config set osd.`OSD-ID` osd_mclock_max_capacity_iops_hdd `HDD_MAX_IOPS`
      
    3. Configuring mClock Profile to high_client_ops to ensure client I/O is always prioritized. :

      Depending on the situation, the profile can also be set to balanced or high_recovery_ops.

      # ceph config set osd  osd_mclock_profile high_client_ops
      

      However, the balanced profile does have the benefit of providing more IOPs to background and maintenance tasks when the IOPs are not needed for client operations. Proper testing should be done to see if running with high_client_ops will provide actual benefits over the balanced profile.

    Correcting those values should provide immediate relief.

  3. For HDD OSDs see Documentation [2] Shard configuration for HDD OSDs when using mClock and make sure those are correctly set.

    Changing the shard and threads per shard configuration does need an OSD restart to take effect.

    This configuration change must only be executed for actual HDD OSDs. Do not configure this setting for SSDs, NVMEs, or devices wrongly detected as HDD by the Operating System.

    | Parameter | Setting |
    |----------------------------------|---------|
    | osd_op_num_shards_hdd | 1 |
    | osd_op_num_threads_per_shard_hdd | 5 |

     

  4. Perceptions of slow backfilling.

    • A recovery operation (caused by some kind of hardware issue). A recovery operation is given high priority as Data Is At Risk. This is true across all current mClock profiles. Using the high_recovery_ops can be used to improve recovery rates a bit further.

    • A backfill operation (caused by an OSD set to out and consequently being set back in, or scaleout operations, or PG count adjustments) is different then a recovery operation (caused by some kind of hardware issue). Data is moved to new locations in addition to all current copies. The Data Is Not At Risk. Therefore with mClock you can expect backfills to be generally slower when compared to degraded recovery operations. This might cause Customer concerns but it is important to know that Data Is Not At Risk and the lower priority for backfill is to ensure client operations are not impacted during the process.

    To tune the backfilling for either case there are the following options available:

    1. Using the high_recovery_ops mClock profile if you are backfilling missing data.
    2. If higher rates are still desired set the osd_mclock_override_recovery_settings to true and then override the osd_max_backfills (for backfilling with data not at risk) or osd_recovery_max_active_[hdd|ssd] (for backfilling missing data) with higher values.

Root Cause

The parameters osd_mclock_max_capacity_iops_ssd and osd_mclock_max_capacity_iops_hdd may be set too low or to high which impacts certain I/O operations on OSDs caused by those unrealistic values. The parameters are set by a quick benchmark evaluation test on OSD startup when no osd_mclock_max_capacity_iops_[ssd|hdd] values are set already. Future versions will have a better way to benchmark an OSD to get more realistic IOP capabilities for disk drives (see BZ #2370442 from the Artifacts section).

The following root cause comment does only refer to actual HDD OSDs.

For HDD based OSDs, in versions below 8.0z2, 7.1z3 and 6.1z9 the osd_op_num_shards_hdd and osd_op_num_threads_per_shard_hdd need to be reconfigured to provide more consistent performance [2].

Artifacts

Bugzilla / Jira

ProductLinkErrataVersionComment
RHCS 8.0BZ This content is not included.#2294594RHBA-2025:24578.0z2New shard configuration for HDD clusters
RHCS 8.0BZ This content is not included.#2292517RHBA-2025:24578.0z2Configure Defaults if benchmark provides unrealistic data
RHCS 7.1BZ This content is not included.#2299480RHBA-2025:17707.1z3New shard configuration for HDD clusters
RHCS 7.1BZ This content is not included.#2330755RHBA-2025:17707.1z3Configure Defaults if benchmark provides unrealistic data
RHCS 6.1BZ This content is not included.#2299482RHSA-2025:42386.1z9New shard configuration for HDD clusters
RHCS 9.0BZ This content is not included.#2370442TBDTBDReimplementation of Capacity Benchmarking
ReferenceDocumentation
[1]Red Hat Ceph Storage Administration Guide, mClock
[2]Content from docs.ceph.com is not included.Ceph Upstream OSD Shard Config for HDD OSDs
[3]Content from docs.ceph.com is not included.Ceph Upstream Mitigation of unrealistic OSD Capacity / Benchmarking
[4]Content from www.ibm.com is not included.IBM mClock Administration and Troubleshooting Guide
[5]Content from docs.ceph.com is not included.Ceph Upstream mClock Configuration Reference

Diagnostic Steps

The IBM documentation [4] has additional troubleshooting scenarios for mClock. See the Ceph Upstream documentation [5] for a detailed configuration reference.

Example of unrealistic max capacity IOPs configured by the on startup benchmark for OSDs

# sudo ceph config dump | grep -E "(WHO|mclock_max_capacity)"
WHO                         MASK  LEVEL     OPTION                                  VALUE
osd.107                           basic     osd_mclock_max_capacity_iops_hdd           2.985434
osd.112                           basic     osd_mclock_max_capacity_iops_hdd           0.198104
osd.12                            basic     osd_mclock_max_capacity_iops_hdd         404.875447
osd.3                             basic     osd_mclock_max_capacity_iops_hdd         256.640426
osd.36                            basic     osd_mclock_max_capacity_iops_hdd         253.152337
osd.41                            basic     osd_mclock_max_capacity_iops_hdd         211.954762
osd.55                            basic     osd_mclock_max_capacity_iops_hdd         263.840410
osd.60                            basic     osd_mclock_max_capacity_iops_hdd         419.974024
osd.75                            basic     osd_mclock_max_capacity_iops_hdd         404.058225
osd.80                            basic     osd_mclock_max_capacity_iops_hdd          66.958426
osd.85                            basic     osd_mclock_max_capacity_iops_hdd           0.198044

Review of default settings for MAX Capacity

# sudo ceph config help osd_mclock_max_capacity_iops_ssd
osd_mclock_max_capacity_iops_ssd - Max random write IOPS capacity (at 4 KiB block size) to consider per OSD (for solid state media)
  (float, basic)
  Default: 21500.000000
  Can update at runtime: true
  Services: [osd]
# sudo ceph config help osd_mclock_max_capacity_iops_hdd
osd_mclock_max_capacity_iops_hdd - Max random write IOPS capacity (at 4KiB block size) to consider per OSD (for rotational media)
  (float, basic)
  Default: 315.000000
  Can update at runtime: true
  Services: [osd]

Determining the speed of your OSDs

# for X in $(ceph osd ls); do echo -n "Look at osd.${X}:    "; ceph tell osd.${X} cache drop; ceph tell osd.${X} bench 12288000 4096 4194304 100 2>&1 | grep iops; done

To test a subset of OSDs, build a list (/tmp/OSDs) with one OSD number per line and run this benchmark

# for X in $(cat /tmp/OSDs); do echo -n "Look at osd.${X}:    "; ceph tell osd.${X} cache drop; ceph tell osd.${X} bench 12288000 4096 4194304 100 2>&1 | grep iops; done

Based on the results and the fact that the system is in use during testing, round up by 5000 and apply that number as the value for osd_mclock_max_capacity_iops_ssd. We do not want a value which is too low and leads to mClock "cutting the tall grass" nor do we want some ridiculously large number which would misconfigured mClock.

The same is true for OSDs back by HDD devices, but the scale is much different. Round up by 100 and apply that number as the value for osd_mclock_max_capacity_iops_hdd. Support has never seen an HDD which warranted a value of more than 800.

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.