How to recover from 'Failed (blocked)'/'Failed (unmanaged)' state when using `on-fail=block`?

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux 6,7 with High Availability of Resilient Storage Add-on
  • resources that are using on-fail=block action

Issue

  • We use on-fail=block for some cluster resource operations and when that operation fails we need to recover the resource. How to achieve this?

  • When we use the on-fail=block we see some resource in following states. How to recover from this?

      # pcs status
      ...
      xxx     (ocf::heartbeat:Dummy): FAILED (unmanaged)[ node1 ]
      ...
    

    or

      # pcs status
      ...
      xxx     (ocf::heartbeat:Dummy): FAILED (blocked)[ node1 ]
      ...
    

Resolution

Operations other than 'stop' with on-fail=block

  1. Inspect the state of resource on node where the operation failed and collect any data you need for troubleshooting the resource.

  2. Cleanup the resource and let the cluster re-detect its state.

     # pcs resource cleanup xxx
    

    or

     # pcs resource refresh xxx
    

'stop' operation with on-fail=block

IMPORTANT: on-fail for 'stop' operation defaults to fence to prevent access of node that failed to shared data. Changing on-fail for 'stop' operation to 'block' exposes shared data for longer time to node that reported failure when stopping! To recover from this situation:

  1. Inspect the state of resource on node where the operation failed and collect any data you need for troubleshooting the resource.

  2. Proceed with fencing the node where failure has happened. For example by rebooting the system or cutting access to shared data from node that reported the failure.


If there is no possibility for fencing the node you can attempt following steps but note that continuing with next steps puts shared data under RISK of DATA CORRUPTION and you continue on your own risk from here.

Make sure that failed resource is stopped by manually stopping its processes or taking any other necessary steps to make it not access shared data any more.

A. To re-integrate the cluster resource back into cluster you can either directly take approach from procedure for operations other than 'stop' and cleanup/refresh the resource.

B. Or you can just recover the monitoring of the resource and keep it first in unmanaged state but with enabled monitor operations as described below.

B.1. Make resource unmanaged by pcs resource unmanage command. (while you will not see any difference in pcs status output, there is functional difference that can be seen in details of cluster resource)

# pcs resource unmanage xxx

B.2. Check that resource has now meta attribute is-managed=false.

# pcs resource show xxx
  Resource: xxx (class=ocf provider=pacemaker type=Dummy)
   Meta Attrs: is-managed=false <-------------
   Operations: start interval=0s timeout=20 (xxx-start-interval-0s)
               stop interval=0s timeout=20 (xxx-stop-interval-0s)
               monitor interval=10 timeout=20 (xxx-monitor-interval-10)

B.3. Cleanup/Refresh the resource.

# pcs resource cleanup xxx

or

# pcs resource refresh xxx

B.4. Check the state of the resource and when ready make the resource again managed by cluster.
NOTE: At this moment the cluster will be monitoring the resource but will not execute other operations against it. This is described more in detail in article Pacemaker running monitor operations on 'unmanaged' resource..

# pcs status
...
xxx     (ocf::heartbeat:Dummy): Stopped (unmanaged) [ node1 ]
...

# pcs resource manage xxx

Root Cause

Pacemaker 1.14 and earlier was using the Failed (unmanaged) message for same situation that Pacemaker 1.15 and later reports as Failed (blocked) when the operation with on-fail=block failed.

Commit Content from github.com is not included.4657ad0567 seems to be the one that started to distinguish between 'blocked' and 'unmanaged' states as shown below.

common_print(resource_t * rsc, const char *pre_text, const char *name, node_t *node, long options, void *print_data)
{
...
-    if(is_not_set(rsc->flags, pe_rsc_managed)) {
-        offset += snprintf(buffer + offset, LINE_MAX - offset, " (unmanaged)");
+    if (is_set(rsc->flags, pe_rsc_block)) {
+        flagOffset += snprintf(flagBuffer + flagOffset, LINE_MAX - flagOffset, "%sblocked", flagOffset?", ":"");
+        rsc->cluster->blocked_resources++;
+
+    } else if (is_not_set(rsc->flags, pe_rsc_managed)) {
+        flagOffset += snprintf(flagBuffer + flagOffset, LINE_MAX - flagOffset, "%sunmanaged", flagOffset?", ":"");
...
}

on-fail=block action can be used to debug complicated situations where other action may prevent collecting the needed data from the state when resource fails some operation. Use of 'block' as on-fail action is however reducing the availability of the cluster resource as the resource will effectively stop being managed by cluster and require the manual intervention for recovery instead of for example restarting itself.

SBR
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.