Ceph peering process stalls when an OSD is down, and the cluster won't recover to a healthy state, why?

Solution Unverified - Updated

Issue

  • During the peering process Ceph may require information from an OSD which is currently down or has been removed from the cluster.

  • When this happens Ceph will wait for the osd to return and the peering process will stall, leaving placement groups in an inactive state.

Resolution

  • The following steps should be followed to resolve the peering problem:

1: Identify which placement groups are being affected.

ceph pg dump | grep peering
  • The output of the above command will have one, or several, lines which look something like this:
3.17d   2639    0       0       0       10277834167     3001    3001    peering    2015-02-22 14:28:20.782434      22555'11703     22564:67511[4,6]   4       [4,6,12]   4       22539'11691     2015-02-22 14:28:20.782372      22539'11691     2015-02-22 14:28:20.782372
  • The first item on each line is the placement group ID. These ID's will be used in the next task:

2: Identify the OSD which needs to be probed.

  • Now that a there is a list of PG IDs in stalled in the peering state, it should be easy to identify the OSD which needs to be probed by running:
# ceph pg <pgid> query
  • The output of this command is a large amount of JSON. At the bottom of the said output, the 'recovery section' can be found, and within that section a set of lines which read:
"down_osds_we_would_probe": [
    85],
  • In this case OSD.85 needs to be brought back up. In order to understand the status of the OSD:

a. Check the status of the host were the OSD is hosted.

b. Check the OSD process, and start it if it is not running. A restart won't hurt even if its running.

NOTE: However, if the OSD has been removed from the cluster, it will need to be re-added to the crushmap and marked as lost in order to allow peering to continue.

3: Marking a removed OSD as lost

  • If the process can not be started and the OSD has been removed from the cluster the OSD should be marked as lost. Please note that this should never be run if that OSD was the last remaining copy of any PG in the cluster. This should not ideally be the case as the cluster should have replicas automatically created based on the size setting on each of the pools.
  1. First, run 'ceph osd create' until you see the ID for the OSD you need to mark lost.

  2. As the second step, execute:

#ceph osd lost 85 --yes-i-really-mean-it
  • This will mark the OSD.85 lost and allow peering to continue. After this has been done, the peering process will continue and the unused OSD ID's can be removed using the following two commands:
# ceph osd crush remove osd.85
# ceph osd rm 85

Root Cause

  • During the peering process, Ceph may require information from an OSD which is currently down or has been removed from the cluster.

  • When this happens, Ceph will wait for the OSD to return and the peering process will stall, leaving placement groups in an inactive state.

  • Generally the root cause for this issue is an OSD either being down or completely removed from the cluster before it has been probed for data required for backfilling to complete.

Diagnostic Steps

  • Utilizing commands such as 'ceph -w' or 'ceph -s' will show that one or more placement groups remain in a 'peering' status, as the following example:
    cluster 00000000-0000-0000-0000-000000000000
     health HEALTH_WARN 17 pgs peering; 22 pgs stuck inactive; 22 pgs stuck unclean; 2 requests are blocked > 32 sec; noscrub,nodeep-scrub flag(s) set
     monmap e5: 5 mons at {ceph1-001=10.0.0.1:6789/0,ceph1-002=10.0.0.2:6789/0,ceph1-018=10.0.0.3:6789/0,ceph1-035=10.0.0.4:6789/0,ceph1-036=10.0.0.5:6789/0}, election epoch 38778, quorum 0,1,2,3,4 ceph1-001,ceph1-002,ceph1-018,ceph1-035,ceph1-036
     osdmap e990992: 349 osds: 349 up, 349 in
            flags noscrub,nodeep-scrub
      pgmap v15216171: 45328 pgs, 16 pools, 8615 GB data, 4102 kobjects
            52029 GB used, 897 TB / 948 TB avail
                   5 inactive
               45306 active+clean
                  16 peering
                   1 remapped+peering
SBR
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.