Ceph: How to reduce scrub impact in a Red Hat Ceph Storage cluster
Environment
- Red Hat Ceph Storage (RHCS) 2.x
- Red Hat Ceph Storage (RHCS) 3.x
- Red Hat Ceph Storage (RHCS) 4.x
- Red Hat Ceph Storage (RHCS) 5.x
Issue
- Deep scrubs on a Red Hat Ceph Storage cluster can sometimes have a negative impact on the client I/O. How can it be controlled?
- Since object scrub is an essential part of the well-being of an RHCS/Ceph cluster, is it possible to run it manually at specific periods?
- Is it possible to disable object scrub altogether and run it manually?
- What are some of the methods available to reduce the impact of a deep-scrub in a Red Hat Ceph Storage cluster?
Resolution
-
Scrubbing is an operation synonymous to a filesystem check, done on the objects in Placement Groups, and is an important part of maintaining the data integrity within the Ceph cluster.
-
There are two types of scrubs,
Softscrub andDeepscrub. -
A
Softscrub happens every day while aDeepscrub happens every week. -
The scrub process generates a list of all objects in the PG being scrubbed, locks a chunk of objects (25 by default), and compare the copy on the primary OSD (in the acting set) with copy present on the replicas. The lock process prevents any I/O onto the corresponding object.
-
The
Softscrub checks the object size and its attributes, while theDeepscrub reads the data and check the checksums to make sure the bits are indeed the same. -
Deepscrubs are I/O intensive processes and may strain the client requests.
Workarounds:
Depending on the scenario, any of the following methods can be used:
1.Set a time at which the deep-scrub would run.
NOTE:
- From RHCS1.3 onwards, it is possible to set time intervals at which a deep-scrub is triggered.
- These are controlled by the
osd_scrub_begin_hourandosd_scrub_end_hourtunables.
- To view the current values on the OSDs in a node, use:
# for i in /var/run/ceph/ceph-osd*; do ceph daemon $i config show | egrep "osd_scrub_begin_hour|osd_scrub_end_hour"; done
osd_scrub_begin_hourandosd_scrub_end_hourare set to0and24respectively. This means that any hour: minute in the entire day can be used to start off the deep-scrub. Setting it to a value in-between0and24will restrict the scrub activities to that specific hours.
- This works for a range in the opposite direction as well.
For example, setting osd_scrub_begin_hour and osd_scrub_end_hour to 18 and 5 will start the scrub sometime between 06:00PM to 05:00AM on the next day.
- You can use tell command to change this setting in runtime
ceph tell osd.* injectargs "--osd-scrub-begin-hour 0"
ceph tell osd.* injectargs "--osd-scrub-end-hour 6"
And to make it persistent, add the same to the ceph.conf in all the OSD nodes
osd_scrub_begin_hour = 0
osd_scrub_end_hour = 6
2.Increase the osd_scrub_max_interval interval to a higher value, and trigger the scrub manually during idle hours.
-
osd_scrub_max_intervaldefault to7days, which means a PG can be skipped off scrubbing for up to a max of seven days if the system load is high. Any PG which has passed the number of days specified byosd_scrub_max_intervalwill get scrubbed irrespective of the system load. -
The scrub process will not get initiated if the system load is or above
osd_scrub_load_threshold. This defaults to0.5and can be tuned as per requirements.
IMPORTANT:
- There is a risk that the scrubs will start when
osd_scrub_max_intervalhits.- Make sure that the administrator is aware of the newly set value and manually scrubs the placement groups out of business hours or at the time of low I/O.
- This will help to be not caught off-guard when
osd_scrub_max_intervalhits.
3.Disable the deep scrub process altogether and run it manually at a time when a lesser number of clients connect or perhaps outside business hours.
IMPORTANT:
- Disabling deep scrub on a Ceph cluster is not suggested since it's synonymous to a file system check for a local filesystem, and is needed to maintain data integrity.
- Disabling deep-scrub will also mark the cluster as
HEALTH_WARN.
- In order to disable deep-scrub, use:
# ceph osd set nodeep-scrub
- Once disabled, make sure it's run manually at a convenient time, using:
# ceph pg deep-scrub <PG.NUM> // This executes the deep-scrub on the PG
OR
# ceph osd deep-scrub <OSD#> // This executes the deep-scrub on the OSD
- Disabling deep-scrub should not be permanent, to enable it again.
# ceph osd unset nodeep-scrub
IMPORTANT:
>Lowering the I/O priority on the scrub process will supplement the above points. Refer the article This content is not included.This content is not included.https://access.redhat.com/solutions/1479843 for details.
MORE INFORMATION
osd_scrub_begin_hour and osd_scrub_end_hour restrict the scrubs (both deep and soft) to specific hours in a day. These take a value between 0 to 24. These integers map to:
- 0 - 12:00AM to 01:00AM
- 1 - 01:00AM to 02:00AM
- 2 - 02:00AM to 03:00AM
.... and so on..
By default, these are set to 0 and 24 respectively which means it can start at any time. By restricting these values a window of time can be specified. The scrub can start anytime within this set period.
As an example, setting the start and end hour to 1 and 6 respectively does not mean that the scrub will start at 1:00 AM and stop at 06:00 AM. It only means the scrub is allowed to start at any time between 01:00 AM and 06:00 AM, and not outside that window.
The osd_scrub_end_hour will not stop an on-going scrub process which was started before osd_scrub_end_hour. It only means that new PGs would not be queued for scrubbing after osd_scrub_end_hour. Hence, any chunk of objects already queued or being scrubbed before the osd_scrub_end_hour hits, will be serviced.
-
If the system load is above the value of the tunable
osd_scrub_load_threshold, the scrub doesn't start even if it is restricted by a customosd_scrub_begin_hourandosd_scrub_end_hour. -
But if a PG has not been scrubbed for more than 7 days (set by the tunable
osd_scrub_max_interval), it is scrubbed regardless of the system load or the scrub start/end hour. -
Hence, in some cases, the scrub may happen outside the start/end hour set by
osd_scrub_begin_hourandosd_scrub_end_hour. -
Along with the above, there are a few more tunable which can affect how a scrub behaves.
* osd_scrub_sleep - A forced delay between scrubs (defaults to `0`)
* osd_scrub_chunk_min - The minimum number of object chunks a scrub happens on, at a time.
* osd_scrub_chunk_max - The maximum number of object chunks a scrub happens on, at a time.
-
The object store is partitioned into chunks which end on hash boundaries. A
chunkis the number of objects that are locked and ready to be scrubbed at once. Once a chunk of objects is locked, any sort of I/O is prevented until the scrub is finished. -
The scrub process locks
osd_scrub_chunk_max(default 25) objects, scrubs them, release the lock, and move to the nextosd_scrub_chunk_maxuntil it reaches the end of the PG. Once the scrub thread reaches the end of the PG, it tries to lockosd_scrub_chunk_minor less depending on how many objects remain. -
osd_scrub_sleepis the time in seconds the scrub process waits between releasing a chunk and locking the next chunk. By default, this is set to0which means that the scrub would be continuous as long as the OSD can read data. This is generally bad for other OSD workloads such as peering, backfilling, rebalancing etc.. -
In a normal environment, it's good to set
osd_scrub_sleepto0.1or0.2and leaveosd_scrub_chunk_minandosd_scrub_chunk_maxto the default values. -
Refer the RHCS documentation on This content is not included.scrubbing.
WARNING:
In case an RBD pool has lot of snapshots, there is a chance that more than osd_scrub_chunk_max is locked for scrubbing, and the scrub on the chunk takes a lot of time which makes slow requests stack up and impact client IO.
This can happen when the object has lot of snapshots/clones. By the way the scrub code is written, all the clones along with the original object has to be scrubbed at the same time. Hence the larger the number of clones of an object, the larger the number of objects that has to be locked for scrub. Locking a large number of projects and hence preventing I/O to it can make the client I/O wait and pile up slow requests.
The fix, being a major change, would only be available in Jewel and most probably in Red Hat Ceph Storage 2.2, and won't be backported to Hammer.
Refer This content is not included.This content is not included.https://bugzilla.redhat.com/show_bug.cgi?id=1373653 for more details.
Root Cause
Deep-scrubs can interfere with the client I/O and in some cases, it needs to be disabled, and run manually.
The steps above describe some of the methods to workaround the case.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.