Pacemaker resource becomes FAILED (blocked)

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 6
  • Red Hat Enterprise Linux (RHEL) 7
  • Red Hat Enterprise Linux (RHEL) 8
  • Red Hat Enterprise Linux (RHEL) 9
  • High-Availability or Resilient Storage Add-on
  • Pacemaker Cluster

Issue

  • Running pcs status reveals a FAILED (blocked) resource:

     Full list of resources:
    
     VIP     (ocf::heartbeat:IPaddr):        Started node1
     DB      (lsb::startdb):        FAILED node1 (blocked)
    

Resolution

In order to resolve the issue that lead to the FAILED state, further troubleshooting and diagnostics would be required to determine the source of the error within the resource agent. Reference the "Diagnostic Steps" section for a possible troubleshooting process and steps to identify the source of the issue.

The FAILED ( blocked ) state may additionally indicate other errors existing in the cluster's stonith or on-fail configurations. Reference the "Root Cause" section for further information on possible contributors to this issue.

After resolving any resource operation errors and/or resolving configuration issues, the cleanup command can be used to clear any errors and test starting the resource again within the cluster:

$ pcs resource cleanup

A support case can be opened with Red Hat if further assistance is needed.

Root Cause

Pacemaker will track the status of every resource operation against the cluster node it is running on. To confirm that a resource is successfully running on any node, Pacemaker will capture the return code when starting, stopping or performing other operations through its OCF or LSB script.

Pacemaker will mark the resource as FAILED ( blocked ) for any of the following conditions:

  • If the resource script returns a non-zero value during any operation that has a on-fail value set to fence and fencing is not working or disabled.
    • The default on-fail action for stop operation is to fence, so a failure to stop without working fencing will result in this status.
    • Note: A working stonith device with cluster fencing enabled is a requirement for pacemaker support and issues affecting stonith will need to be corrected for full support of Pacemaker clusters.
  • The on-fail action is set to block for a resource operation and a failure is observed.
  • In some cases if the resource is unable to run on any node the FAILED status will be reported.
  • Other fail conditions may apply.

Diagnostic Steps

In order to further troubleshoot issues affecting resources on startup and/or stop the following process can be used:

  1. Disable the resource and restart the resource prior to taking any further troubleshooting steps:
  • Attempting to start a resource that is in this blocked state following a fencable issue can lead to unexpected behaviors or even data corruption so it is important to reboot a node before running debug-start or debug-stop steps.
# Stop the resource from automatically starting in the cluster on boot:
$ pcs resource disable <resource>

# ( If there are pending fences ) Reboot the node to recover from possible unclean state:
$ reboot now
  1. To start and/or stop the resource with additional debug output the below commands can be ran:
  • Note: These commands will not start the resource under pacemaker's management. It is recommended to only perform these actions while the resource is in a "disabled" state to avoid conflicts:
# Run from the node you wish to start the resource on to collect more info on start operations:
# Optionally include `--full` option for full traces.
$ pcs resource debug-start  <resource-name> --full

# Run from the node you wish to stop the resource on to collect more info on stop operations:
# Optionally include `--full` option for full traces.
$ pcs resource debug-stop  <resource-name> --full 
  1. A return code of 0 means the "start" or "stop" operations were successful. Non-zero return codes will require further troubleshooting specific to the cause of the error.
SBR
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.