A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?

Solution Unverified - Updated 5 Aug 2024

Environment

Red Hat Ceph Enterprise 1.2.3
Inktank Ceph Enterprise 1.2

Issue

A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?
After removing a failed OSD on a three node Ceph cluster, the data movement/balance started between the existing OSDs, but stalled. This causes the Ceph cluster to get stuck with degraded PGs.
The 'osd_pool_default_size' is set to 3 and 'osd_pool_default_min_size' to 2.
A 'ceph -s' shows the following:

# ceph -s
    cluster 16ce9ce1-aa5f-445f-b994-5699730f364a
     health HEALTH_WARN 326 pgs degraded; 366 pgs stuck unclean; recovery 975/83301 objects degraded (1.170%)
     monmap e1: 3 mons at {mon-01=172.28.225.72:6789/0,mon-02=172.28.225.73:6789/0,mon-03=172.28.225.74:6789/0}, election epoch 18, quorum 0,1,2 mon-01,mon-02,mon-03
     osdmap e540: 29 osds: 29 up, 29 in
      pgmap v2727568: 9408 pgs, 19 pools, 135 GB data, 27767 objects
            403 GB used, 80437 GB / 80840 GB avail
            975/83301 objects degraded (1.170%)
                9042 active+clean
                 326 active+degraded
                  40 active+remapped
  client io 20363 B/s wr, 1 op/s

The above is the current state, and there is no more recovery occurring.
A 'ceph osd tree' shows:

# ceph osd tree
# id    weight  type name       up/down reweight
-1      81.6    root default
-2      27.2            host node-c01
0       2.72                    osd.0   DNE
1       2.72                    osd.1   up      1
2       2.72                    osd.2   up      1
3       2.72                    osd.3   up      1
4       2.72                    osd.4   up      1
5       2.72                    osd.5   up      1
6       2.72                    osd.6   up      1
7       2.72                    osd.7   up      1
8       2.72                    osd.8   up      1
9       2.72                    osd.9   up      1
-3      27.2            host node-02
10      2.72                    osd.10  up      1
11      2.72                    osd.11  up      1
12      2.72                    osd.12  up      1
13      2.72                    osd.13  up      1
14      2.72                    osd.14  up      1
15      2.72                    osd.15  up      1
16      2.72                    osd.16  up      1
17      2.72                    osd.17  up      1
18      2.72                    osd.18  up      1
19      2.72                    osd.19  up      1
-4      27.2            host node3-03
20      2.72                    osd.20  up      1
21      2.72                    osd.21  up      1
22      2.72                    osd.22  up      1
23      2.72                    osd.23  up      1
24      2.72                    osd.24  up      1
25      2.72                    osd.25  up      1
26      2.72                    osd.26  up      1
27      2.72                    osd.27  up      1
28      2.72                    osd.28  up      1
29      2.72                    osd.29  up      1

Resolution

Remove the problematic OSD for now.

# ceph osd crush remove osd.0

To know how to remove an OSD from the map, follow Content from ceph.com is not included.Content from ceph.com is not included.http://ceph.com/docs/master/rados/operations/crush-map/#remove-an-osd

Root Cause

The default crush setting in firefly caused it to try mapping it to a specific OSD, even if it is out. This made the OSD which was out to be mapped to be placed the objects on.
Since the OSD was out, the objects were not able to be mapped onto it. Hence the PGs remained stuck and degraded.
There are two workarounds for this:

a) The first quick workaround is to remove the problematic OSD from there, so that crush does not try mapping it to the same problematic OSD and fail. Removing it will force the process to find another OSD and map the objects there.

b) The suggested workaround is to change the setting that forces crush to map to the same OSD always. This can be changed by setting 'chooseleaf_vary_r' to '4' or '5', even though the optimal value is '1'.

NOTE: Setting it to '1' will cause lots of data movement between the OSDs. Make sure this possibility is foreseen.

c) From Content from ceph.com is not included.Content from ceph.com is not included.http://ceph.com/docs/giant/rados/operations/crush-map/#placing-different-pools-on-different-osds

chooseleaf_vary_r: Whether a recursive chooseleaf attempt will start with a non-zero value of r, based on how many attempts the parent has already made. Legacy default is 0, but with this value CRUSH is sometimes unable to find a mapping. The optimal value (in terms of computational cost and correctness) is 1. However, for legacy clusters that have lots of existing data, changing from 0 to 1 will cause a lot of data to move; a value of 4 or 5 will allow CRUSH to find a valid mapping but will make less data move.

SBR

Ceph

Product(s)

Inktank Ceph Enterprise

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.