Ceph Cluster Scheduled deep-scrubbing

Solution Verified - Updated 5 Aug 2024

Environment

Ceph Dumpling earlier than 0.67.12
Ceph Firefly earlier than 0.80.8

Issue

How to manually schedule a deep-scrub, in order to reduce its impact on client operations?
In Ceph versions prior to Firefly (v0.80.8), no tunables were available to reduce the impact of a deep-scrub on client operations, like reducing the I/O priority or the frequency of the deep-scrub. How can the deep-scrub be tuned manually to reduce the impact?
How can a deep-scrub on a Ceph cluster be scheduled for a particular day/time? This can be useful to schedule the deep-scrub out of business hours, or at a time when a lesser number of clients are reading/writing from/to the cluster.

Resolution

NOTE: This is not applicable for Red Hat Ceph Enterprise 1.{X}, since RHCS1.{X} comes with versions above Firefly 0.80.8.

Disable the automatic deep-scrubbing operation on the cluster, which is set to happen every week

# ceph osd set nodeep-scrub

Schedule the script below using either cron or any commercial scheduling tool. The script may need customization

a) For Dumpling earlier than 0.67.12

# while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}' | sort | head -1 | tee ./lastpg.txt >/dev/null
 do read date time apg <./lastpg.txt
  #
  # Insert some code logic here to check the date and time the PG was deep-scrubbed against the current date and time
  #
  echo Processing ${apg} last deep-scrubbed on $date at $time
  ceph pg deep-scrub ${apg}
  while ceph status | grep scrubbing+deep
  do sleep 5
  done
  sleep 30
  done

b) Firefly versions earlier than 0.80.8

while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $22, $23, $1}' | sort | head -1 | tee ./lastpg.txt >/dev/null
do read date time apg <./lastpg.txt
#
# Insert some code logic here to check the date and time the PG was deep-scrubbed against the current date and time
#
echo Processing ${apg} last deep-scrubbed on $date at $time
ceph pg deep-scrub ${apg}
while ceph status | grep scrubbing+deep
do
sleep 5
done
sleep 30
done

Note that this script must be scheduled on a server that has network access to all MON hosts in the cluster. The machine should also have a copy of the cluster's /etc/ceph/ceph.conf file and a copy of the admin keyring.

Root Cause

During a deep-scrub, the I/O requests from the client machines may be strained due to the load and priority given to the deep-scrub.

Diagnostic Steps

A 'ceph -s' or 'ceph health detail' would show 'Slow requests', at the time of a deep-scrub in the cluster.
When deep-scrubbing operations are being processed on the Placement Groups by the OSDs, you may witness "Slow Request" warning message when you issue a ceph -s command or a ceph health detail command.

SBR

Ceph

Product(s)

Inktank Ceph Enterprise

Category

Troubleshoot

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.