What are the possible Placement Group states in an RHCS/Ceph cluster?
Environment
-
Red Hat Ceph Storage 1.3.x
-
Red Hat Ceph Storage 2.x
Issue
- What are some of the possible states the Placement Groups in an RHCS/Ceph cluster can be in?
Resolution
-
When checking a cluster’s status (e.g., running
ceph -worceph -s), Ceph will report the status of placement groups. -
A placement group can have one or more states. The optimum state for placement groups in the PG map is
active + clean. -
Placement groups which are in
active+<other-state>should ideally serve data. -
Placement groups that has a status of
downwould not serve data. Useceph health detailto map the backing OSDs for such PGs and investigate the OSD states further. -
An example of a HEALTHY cluster:
$ ceph -s
cluster <UUID>
health HEALTH_OK
monmap e3: 1 mons at {<MONITOR-NODE-1>=<IP-ADDRESS>:<PORT>,...}
election epoch 5, quorum 0 hp-m300-4
osdmap e1850: 18 osds: 18 up, 18 in
pgmap v1411049: 1312 pgs, 12 pools, 91017 kB data, 52 objects
591 GB used, 308 GB / 899 GB avail
1312 active+clean <== (Number of Placement Groups and it's state)
- Example for a cluster that is not HEALTHY.
# ceph -s
cluster <UUID>
health HEALTH_WARN
230 pgs backfill
65 pgs backfilling
216 pgs degraded
7 pgs peering
35 pgs recovering
118 pgs recovery_wait
37 pgs stuck inactive
472 pgs stuck unclean
17 pgs undersized
recovery 1562/194386 objects degraded (0.804%)
recovery 14899/194386 objects misplaced (7.665%)
monmap e3: 1 mons at {<MONITOR-NODE-1>=<IP-ADDRESS>:<PORT>,...}
osdmap e1021: 36 osds: 36 up, 36 in; 266 remapped pgs
pgmap v12274993: 5056 pgs, 17 pools, 560 GB data, 92756 objects
1170 GB used, 39031 GB / 40201 GB avail
1562/194386 objects degraded (0.804%)
14899/194386 objects misplaced (7.665%)
4508 active+clean
210 active+remapped+wait_backfill
69 active+recovery_wait+degraded
49 active+recovery_wait+degraded+remapped
48 active+remapped+backfilling
37 activating
23 active+recovering+degraded
15 active+remapped
14 active+undersized+degraded+remapped+wait_backfill
14 activating+remapped
14 active+degraded+remapped+backfilling
14 activating+degraded+remapped
12 active+recovering+degraded+remapped
9 activating+degraded
7 peering
6 active+degraded+remapped+wait_backfill
3 active+undersized+degraded+remapped+backfilling
3 active+degraded+remapped
1 inactive
IMPORTANT:
-
The network is an important aspect of a distributed system, hence make sure network configurations are unique throughout the cluster network and the interfaces, both
cluster_networkandpublic_network. -
For example, a unique MTU should be used (either 9000 or 1500) across the network interfaces in the OSD and MON nodes.
-
The presence of different MTUs in a cluster can bring in unexpected behaviours such as OSD flapping, heartbeat_check failures, OSDs being wrongly marked down, placement groups stuck in peering etc..
-
The
ceph -sexample output above shows a few of the various states a Placement Group can be in. -
An explanation of each PG state is given below:
Creating
- Ceph is still creating the placement group.
Activating
- The placement group is peered but not yet active.
Active
- Ceph will process requests to the placement group. Active Placement Groups will serve data.
Clean
- Ceph replicated all objects in the placement group the correct number of times.
active+cleanis the ideal PG state.
Down
- A replica with necessary data is down, so the placement group is offline.
- A PG with less than
min_sizereplicas will be marked asdown.Useceph health detail` to understand the backing OSD state.
Laggy
- An OSD replica is not acknowledging new leases from the primary OSD in a timely manner. I/O is temporarily paused.
Wait
The set of OSDs for this PG has just changed and I/O is temporarily paused until the previous interval’s leases expire.
Replay
The placement group is waiting for clients to replay operations after an OSD crashed.
Splitting
Ceph is splitting the placement group into multiple placement groups. (functional?)
Scrubbing
Ceph is checking the placement group for inconsistencies.
Deep
Ceph is checking the placement group data against stored checksums.
Degraded
Ceph has not replicated some objects in the placement group the correct number of times yet.
Inconsistent
Ceph detects inconsistencies in the one or more replicas of an object in the placement group (e.g. objects are the wrong size, objects are missing from one replica after recovery finished, etc.).
Peering (peering)
- The placement group is undergoing the peering process.
- A peering process should clear off without much delay, but if it stays and the number of PGs in a peering state does not reduce in number, the peering may be stuck.
- To understand why a PG is stuck in peering, query the placement group and check if it is waiting on any other OSDs. To query a PG, use:
# ceph pg <pg.id> query
- If the PG is waiting on another OSD for the peering to finish, bringing up that OSD should solve this.
Repair
Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).
Recovering
Ceph is migrating/synchronising objects and their replicas.
Backfill
Ceph is scanning and synchronising the entire contents of a placement group instead of inferring what contents need to be synchronised from the logs of recent operations. Backfill is a special case of recovery.
Wait-backfill
The placement group is waiting in line to start backfill.
Backfill-toofull (backfill_toofull)
- A backfill operation is waiting because the destination OSD is over its full ratio.
- Placement Groups which are in a
backfill_toofullstate will have the backing OSDs hitting theosd_backfill_full_ratio(0.85 by default). - Any OSD hitting this threshold will prevent data backfilling from other OSDs to itself.
- NOTE: Any PGs hitting
osd_backfill_full_ratiowill still serve read and writes, and also rebalance. Only the backfill is blocked, to prevent the OSD hitting thefull_ratiofaster. - To understand the
osd_backfill_full_ratioof the OSDs, use:
# ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep backfill_full_ratio
*Backfill-unfound (backfill_unfound)
- Backfill has stopped due to unfound objects.
Incomplete
Ceph detects that a placement group is missing information about writes that may have occurred, or does not have any healthy copies. If any of the Placement Groups are in this state, try starting any failed OSDs that may contain the needed information or temporarily adjust min_size to allow recovery.
Remapped
The placement group is temporarily mapped to a different set of OSDs from what CRUSH specified.
Undersized
The placement group fewer copies than the configured pool replication level.
When the number of replicas falls below the pool's 'size' configuration the PG state will show something similar to 'active+undersized'.
However, when the number of replicas falls below the pool's 'min_size' configuration, the PG state will reflect 'undersized' ie it will not have 'active+' and will be read-only state.
Peered
The placement group has peered but cannot serve client IO due to not having enough copies to reach the pool’s configured min_size parameter. Recovery may occur in this state, so the pg may heal up to min_size eventually.
IMPORTANT
A placement group can be in any of the above states and doesn't necessarily show a problem because it's not active + clean. It should ultimately reach an active + clean state automatically, but manual intervention may be needed sometime. Placement Groups in active+<some-state-other-than-clean> should serve data, since the PG is still active.
Usually, Ceph tries to fix/repair the Placement Group states and make it active + clean, but the PGs can end up in a stuck state in certain cases. The stuck states include:
Inactive
Placement groups in the Inactive state won't accept any I/O. They are usually waiting for an OSD with the most up-to-date data to come back up. In case the UP set and ACTING set are same, and the OSDs are not blocked on any other OSDs, this can be a problem with peering. Manually marking the primary OSD down will force the peering process to start since Ceph would bring the primary OSD back automatically up. The peering process is kickstarted once an OSD comes up.
Stale
The placement group is in an unknown state - because the OSDs that host them have not reported to the monitor cluster in a while (configured by mon_osd_report timeout).
Unclean
Placement groups contain objects that are not replicated the desired number of times. A very common reason for this would be OSDs that are down or OSDs with a 0 crush weight which prevents the PGs to replicate data onto the OSDs and thus achieve a clean state.
Following are two more new PG states which were added in jewel release for snapshot trimming feature.
snaptrim:
The PGs are currently being trimmed
snaptrim_wait:
The PGs are waiting to be trimmed
- To identify stuck placement groups, execute the following:
# ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
Note:
For more detail explanation of placement group states, please check This content is not included.monitoring_placement_group_states.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.