Ceph: How to reduce scrub impact in a Red Hat Ceph Storage cluster

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Ceph Storage (RHCS) 2.x
Red Hat Ceph Storage (RHCS) 3.x
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x

Issue

Deep scrubs on a Red Hat Ceph Storage cluster can sometimes have a negative impact on the client I/O. How can it be controlled?
Since object scrub is an essential part of the well-being of an RHCS/Ceph cluster, is it possible to run it manually at specific periods?
Is it possible to disable object scrub altogether and run it manually?
What are some of the methods available to reduce the impact of a deep-scrub in a Red Hat Ceph Storage cluster?

Resolution

Scrubbing is an operation synonymous to a filesystem check, done on the objects in Placement Groups, and is an important part of maintaining the data integrity within the Ceph cluster.
There are two types of scrubs, Soft scrub and Deep scrub.
A Soft scrub happens every day while a Deep scrub happens every week.
The scrub process generates a list of all objects in the PG being scrubbed, locks a chunk of objects (25 by default), and compare the copy on the primary OSD (in the acting set) with copy present on the replicas. The lock process prevents any I/O onto the corresponding object.
The Soft scrub checks the object size and its attributes, while the Deep scrub reads the data and check the checksums to make sure the bits are indeed the same.
Deep scrubs are I/O intensive processes and may strain the client requests.

Workarounds:

Depending on the scenario, any of the following methods can be used:

1.Set a time at which the deep-scrub would run.

NOTE:

From RHCS1.3 onwards, it is possible to set time intervals at which a deep-scrub is triggered.

These are controlled by the osd_scrub_begin_hour and osd_scrub_end_hour tunables.

To view the current values on the OSDs in a node, use:

# for i in /var/run/ceph/ceph-osd*; do ceph daemon $i config show | egrep "osd_scrub_begin_hour|osd_scrub_end_hour"; done

osd_scrub_begin_hour and osd_scrub_end_hour are set to 0 and 24 respectively. This means that any hour: minute in the entire day can be used to start off the deep-scrub. Setting it to a value in-between 0 and 24 will restrict the scrub activities to that specific hours.

This works for a range in the opposite direction as well.

For example, setting osd_scrub_begin_hour and osd_scrub_end_hour to 18 and 5 will start the scrub sometime between 06:00PM to 05:00AM on the next day.

You can use tell command to change this setting in runtime

ceph tell osd.* injectargs "--osd-scrub-begin-hour 0"
ceph tell osd.* injectargs "--osd-scrub-end-hour 6"

And to make it persistent, add the same to the ceph.conf in all the OSD nodes

osd_scrub_begin_hour = 0
osd_scrub_end_hour = 6

2.Increase the osd_scrub_max_interval interval to a higher value, and trigger the scrub manually during idle hours.

osd_scrub_max_interval default to 7 days, which means a PG can be skipped off scrubbing for up to a max of seven days if the system load is high. Any PG which has passed the number of days specified by osd_scrub_max_interval will get scrubbed irrespective of the system load.
The scrub process will not get initiated if the system load is or above osd_scrub_load_threshold. This defaults to 0.5 and can be tuned as per requirements.

IMPORTANT:

There is a risk that the scrubs will start when osd_scrub_max_interval hits.

Make sure that the administrator is aware of the newly set value and manually scrubs the placement groups out of business hours or at the time of low I/O.

This will help to be not caught off-guard when osd_scrub_max_interval hits.

3.Disable the deep scrub process altogether and run it manually at a time when a lesser number of clients connect or perhaps outside business hours.

IMPORTANT:

Disabling deep scrub on a Ceph cluster is not suggested since it's synonymous to a file system check for a local filesystem, and is needed to maintain data integrity.

Disabling deep-scrub will also mark the cluster as HEALTH_WARN.

In order to disable deep-scrub, use:

# ceph osd set nodeep-scrub

Once disabled, make sure it's run manually at a convenient time, using:

# ceph pg deep-scrub <PG.NUM>  // This executes the deep-scrub on the PG

# ceph osd deep-scrub <OSD#>   // This executes the deep-scrub on the OSD

Disabling deep-scrub should not be permanent, to enable it again.

# ceph osd unset nodeep-scrub

IMPORTANT:
>Lowering the I/O priority on the scrub process will supplement the above points. Refer the article This content is not included.This content is not included.https://access.redhat.com/solutions/1479843 for details.

MORE INFORMATION

osd_scrub_begin_hour and osd_scrub_end_hour restrict the scrubs (both deep and soft) to specific hours in a day. These take a value between 0 to 24. These integers map to:

0 - 12:00AM to 01:00AM
1 - 01:00AM to 02:00AM
2 - 02:00AM to 03:00AM
.... and so on..

By default, these are set to 0 and 24 respectively which means it can start at any time. By restricting these values a window of time can be specified. The scrub can start anytime within this set period.

As an example, setting the start and end hour to 1 and 6 respectively does not mean that the scrub will start at 1:00 AM and stop at 06:00 AM. It only means the scrub is allowed to start at any time between 01:00 AM and 06:00 AM, and not outside that window.

The osd_scrub_end_hour will not stop an on-going scrub process which was started before osd_scrub_end_hour. It only means that new PGs would not be queued for scrubbing after osd_scrub_end_hour. Hence, any chunk of objects already queued or being scrubbed before the osd_scrub_end_hour hits, will be serviced.

If the system load is above the value of the tunable osd_scrub_load_threshold, the scrub doesn't start even if it is restricted by a custom osd_scrub_begin_hour and osd_scrub_end_hour.
But if a PG has not been scrubbed for more than 7 days (set by the tunable osd_scrub_max_interval), it is scrubbed regardless of the system load or the scrub start/end hour.
Hence, in some cases, the scrub may happen outside the start/end hour set by osd_scrub_begin_hour and osd_scrub_end_hour.
Along with the above, there are a few more tunable which can affect how a scrub behaves.

* osd_scrub_sleep - A forced delay between scrubs (defaults to `0`)
* osd_scrub_chunk_min - The minimum number of object chunks a scrub happens on, at a time.
* osd_scrub_chunk_max - The maximum number of object chunks a scrub happens on, at a time.

The object store is partitioned into chunks which end on hash boundaries. A chunk is the number of objects that are locked and ready to be scrubbed at once. Once a chunk of objects is locked, any sort of I/O is prevented until the scrub is finished.
The scrub process locks osd_scrub_chunk_max (default 25) objects, scrubs them, release the lock, and move to the next osd_scrub_chunk_max until it reaches the end of the PG. Once the scrub thread reaches the end of the PG, it tries to lock osd_scrub_chunk_min or less depending on how many objects remain.
osd_scrub_sleep is the time in seconds the scrub process waits between releasing a chunk and locking the next chunk. By default, this is set to 0 which means that the scrub would be continuous as long as the OSD can read data. This is generally bad for other OSD workloads such as peering, backfilling, rebalancing etc..
In a normal environment, it's good to set osd_scrub_sleep to 0.1 or 0.2 and leave osd_scrub_chunk_min and osd_scrub_chunk_max to the default values.
Refer the RHCS documentation on This content is not included.scrubbing.

WARNING:
In case an RBD pool has lot of snapshots, there is a chance that more than osd_scrub_chunk_max is locked for scrubbing, and the scrub on the chunk takes a lot of time which makes slow requests stack up and impact client IO.

This can happen when the object has lot of snapshots/clones. By the way the scrub code is written, all the clones along with the original object has to be scrubbed at the same time. Hence the larger the number of clones of an object, the larger the number of objects that has to be locked for scrub. Locking a large number of projects and hence preventing I/O to it can make the client I/O wait and pile up slow requests.

The fix, being a major change, would only be available in Jewel and most probably in Red Hat Ceph Storage 2.2, and won't be backported to Hammer.

Refer This content is not included.This content is not included.https://bugzilla.redhat.com/show_bug.cgi?id=1373653 for more details.

Root Cause

Deep-scrubs can interfere with the client I/O and in some cases, it needs to be disabled, and run manually.
The steps above describe some of the methods to workaround the case.

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Learn more

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.