Ceph - OSD reboots every few minutes with FAILED assert(clone_size.count(clone))

Solution Verified - Updated

Environment

  • Red Hat Ceph Storage 1.2
  • Red Hat Ceph Storage 1.2.3
  • Red Hat Ceph Storage 1.3
  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • Ubuntu Precise 12.04
  • Ubuntu Trusty 14.04

Issue

  • A single OSD reboots every few minutes. When this OSD is marked as "OUT" and another OSD backfills/takes its place the new OSD also begins to reboot continuously.
  • In the OSD log (/var/log/ceph/ceph-osd.*.log) prior to the OSD logging the assert, we can see that Ceph is scrubbing a PG:
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -4> 2015-10-08 11:00:55.862853 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub   osd.91 has 24 items
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -3> 2015-10-08 11:00:55.862877 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub replica 32 has 24 items
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -2> 2015-10-08 11:00:55.862885 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub replica 69 has 24 items
Oct  8 11:00:55 str-yyz-02-01 ceph-osd:     -1> 2015-10-08 11:00:55.863074 7f33c3e26700  2 osd.91 pg_epoch: 66350 pg[5.2d6( v 66302'153650 (48156'150649,66302'153650] local-les=66350 n=1553 ec=65 les/c 66350/66350 66349/66349/66349) [91,69,32] r=0 lpr=66349 crt=66302'153647 lcod 0'0 mlcod 0'0 active+clean+scrubbing] 
Oct  8 11:02:31 str-yyz-02-01 ceph-osd:      0> 2015-10-08 11:02:31.048794 7f5782775700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f5782775700 time 2015-10-08 11:02:31.047352#012osd/osd_types.cc: 3543: FAILED assert(clone_size.count(clone))#012#012 

ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x707b46]#012 
2: (ReplicatedPG::_scrub(ScrubMap&)+0x9e8) [0x7c0198]#012 
3: (PG::scrub_compare_maps()+0x5b6) [0x755306]#012 
4: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1d9) [0x758999]#012 
5: (PG::scrub(ThreadPool::TPHandle&)+0x19a) [0x75b96a]#012 
6: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x657309]#012 
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa56dd1]#012 
8: (ThreadPool::WorkThread::entry()+0x10) [0xa57cc0]#012 
9: (()+0x8182) [0x7f579ce6e182]#012 
10: (clone()+0x6d) [0x7f579b5e0fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • Also prior to the assert, the following error which is identifying the actual object and snapset with the issue may be observed in the OSD log :
Oct  8 06:46:59 str-yyz-02-01 ceph-osd: 2015-10-08 06:46:59.844990 7fa3158a3700  0 log [ERR] : 5.2d6 shard 91: soid **4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5** size 0 != known size 4194304

Resolution

  • If you log the following error in the OSD logs identifying the object and snapset that is corrupted, then you can skip ahead to removing the object from OSD section, If an error does not identify the corrupt object in the log then it will need to be removed. Then debugging will need to be enabled on the OSD to capture this information. This can be set via injecting the arguments to the OSD or setting in the ceph.conf file for the OSD in question. If the OSD is rebooting continuously then you will need to set in ceph.conf file as it may not effectively take the injection command.
Oct  8 06:46:59 str-yyz-02-01 ceph-osd: 2015-10-08 06:46:59.844990 7fa3158a3700  0 log [ERR] : 5.2d6 shard 91: soid 4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5 size 0 != known size 4194304
  • In the above error the object and snap ID is (listed in bold): 4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5

Setting debug on an OSD:

Example of setting debug on an OSD via command line (where * is replaced with the OSD number):

[osd-host-a@admin]#  ceph tell osd.* injectargs --debug-osd 20 --debug-ms 1 --debug-filestore 20

Example of setting debug on an OSD via the ceph.conf file (this will set across the cluster, but only the OSD which is restarting will take the logging change):

[global]
        debug ms = 1/5  

[osd]
        debug osd = 1/5
        debug filestore = 1/5
  • Once debugging has been enabled and the OSD has rebooted again, review the log for the following which will occur right before the assert:
Oct 15 11:05:50 str-yyz-02-01 ceph-osd:     -1> 2015-10-15 11:05:50.888786 7f05fc0aa700 20 osd.91 pg_epoch: 86297 pg[5.2d6( v 86291'157317 (70796'154316,86291'157317] local-les=86297 n=1416 ec=65 les/c 86297/86297 86296/86296/86296) [91,69,32] r=0 lpr=86296 crt=86279'157313 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub  4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5 4993e2d6/**rbd_data.ed0c0f561c681.0000000000000a53/966f**//5(43933'135273 client.15590667.0:6709 [966f] dirty s 4194304 uv124869)
  • In the example above with debugging enabled on the OSD, the object in the PG being scrubbed when the assert occurs can be found. The object is rbd_data.ed0c0f561c681.0000000000000a53 and the snapset is 0x966f converted to decimal: 38511
  • Now that the object information that is causing the scrubber thread to crash and assert the OSD has been gathered, the object can be removed from the OSD via the ceph objectstore tool.
  • WARNING: Using the ceph-objectstore-tool is a risky process. All ceph-objectstore-tool commands need to be run as root or with the sudo command. Do not attempt this on a production cluster without engaging Red Hat Ceph Storage support. You could cause irreversible data loss in your cluster. For all commands except import-rados the given ceph-osd daemon MUST NOT be running; otherwise you will receive an "OSD has the store locked" error.
  • For additional information on the ceph-objectstore-tool
  • Upstream information for Content from ceph.com is not included.ceph-objectstore-tool

Removing an object from an OSD with Ceph Objectstore tool:

  • Set no_out flag on cluster
ceph osd set noout
  • Stop the OSD service for the OSD hosting the PG with the corrupted object on the OSD node
  • Export the current PG to a backup folder
[osd-host-a@admin]# ceph-objectstore-tool --data-path </path/to/osd> --journal-path </path/to/journal> --op export --pgid <PG id> --file xxxxdestinationxxxx
  • Get JSON of the object in question, note that your snapset ID which is in hexadecimal will be in decimal. Only need the one with your snapsets "snapid" and "namespace":"" in the JSON.
[osd-host-a@admin]# ceph-objectstore-tool --data-path </path/to/osd> --journal-path </path/to/journal> --op list <object name>
  • Remove the object from the OSD and each of it's replica's by following the above steps for each replica.
[osd-host-a@admin]# ceph-objectstore-tool --data-path </path/to/osd> --journal-path </path/to/journal> '["PG id",{"Info from JSON of the object for the proper Snap ID}]' remove
  • Unset no_out
ceph ost unset noout
  • Restart the OSD.
  • The PG will need to be repaired to rebuild the snapset from the valid header information.
[osd-host-a@admin]# ceph pg repair {placement-group-ID}

-Verify no further OSD crashes during scrubbing.

Root Cause

  • The problem to focus on can be observed in the OSD logs:
Oct 15 11:05:50 str-yyz-02-01 ceph-osd:     -1> 2015-10-15 11:05:50.888786 7f05fc0aa700 20 osd.91 pg_epoch: 86297 pg[5.2d6( v 86291'157317 (70796'154316,86291'157317] local-les=86297 n=1416 ec=65 les/c 86297/86297 86296/86296/86296) [91,69,32] r=0 lpr=86296 crt=86279'157313 lcod 0'0 mlcod 0'0 active+clean+scrubbing] scrub  4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5 4993e2d6/rbd_data.ed0c0f561c681.0000000000000a53/966f//5(43933'135273 client.15590667.0:6709 [966f] dirty s 4194304 uv124869)
  • In the snippet above, a corrupted snapset was found during scrubbing. This is incorrectly handled by the scrubbing logic and the OSD crashes every time the scrubber detects the corrupted snapset. This logic has already been re-written so the scrub doesn't crash, but this has not been added to any release at the time of this writing. In the future, scrub will report an unexpected clone in the ObjectStore, a tracker is open Content from tracker.ceph.com is not included.upstream .
  • Bugzilla 1273127 has been open to address this issue downstream.

Diagnostic Steps

  • Below is an example of the above actions for the error occurring on OSD.30:
2015-09-03 10:52:15.164562 7fb1f4afe700 20 osd.30 pg_epoch: 227938 pg[3.f9( v 227870'10797633 (211962'10794623,227870'10797633] local-les=227938 n=11612 ec=101 les/c 227938/227938 227937/227937/227937) [30,10] r=0 lpr=227937 crt=227623'10797625 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep] deep-scrub c55800f9/rb.0.8c2990.238e1f29.000000008cc0/23ed//3 c55800f9/rb.0.8c2990.238e1f29.000000008cc0/23ed//3(11044'10472430 client.9186139.0:268115 [23ed] dirty|omap_digest s 4194304 uv 9525846 od ffffffff) 0> 2015-09-03 10:52:15.170668 7fb1f4afe700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fb1f4afe700 time 2015-09-03 10:52:15.164585 
osd/osd_types.cc: 4076: FAILED assert(clone_size.count(clone))
  • The message preceding the assert is in the same thread so ** rb.0.8c2990.238e1f29.000000008cc0/23ed//3** has the object name in it. "23ed" is the RADOS clone/snap ID.
  • Object information:
    Object: rb.0.8c2990.238e1f29.000000008cc0/23ed//3

23ed is the RADOS clone/snap ID
Hex: 23ed = Dec: 9197

  • Actions:
  1. ceph osd set noout
  2. service ceph stop osd.30
  3. Get a backup by export the pg using the ceph-objectstore-tool. Specify a --file somewhere with enough of disk space.
# ceph-objectstore-tool --data-path xxxxxx --journal-path xxxxxx --op export --pgid 3.f9 --file xxxxdestinationxxxx


<Example Output>
Exporting 3.f9
....
Read 3/c55800f9/rb.0.8c2990.238e1f29.000000008cc0/23ed
....
Export successful
  1. Now JSON of the object in question is needed. The 3rd line of output has the snapid 9197 which is 23ed in decimal.
# ceph-objectstore-tool --data-path xxxxxx --journal-path xxxxxx --op list rb.0.8c2990.238e1f29.000000008cc0 


<Example output>

["3.f9",{"oid":"rb.0.8c2990.238e1f29.000000008cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] ["3.f9",{"oid":"rb.0.8c2990.238e1f29.000000008cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] ["3.f9",{"oid":"rb.0.8c2990.238e1f29.000000008cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] ["3.f9",{"oid":"rb.0.8c2990.238e1f29.000000008cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 
  1. To remove it, cut and paste the output line with snapid 9197 inside single quotes:
# ceph-objectstore-tool --data-path xxxxxx --journal-path xxxxxx '["3.f9",{"oid":"rb.0.8c2990.238e1f29.000000008cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' remove


<Example output>
remove 3/c55800f9/rb.0.8c2990.238e1f29.000000008cc0/23ed
  1. Remove the object from the replicas using the same method as with the first object.
  2. Repair the PG
# ceph pg repair 3.f9
SBR

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.