An 'unmanaged' pacemaker resource monitor failure can trigger recovery to occur

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7 with High Availability Add-Ons
  • Red Hat Enterprise Linux 6.5 with High Availability Add-Ons
  • pacemaker cluster

Issue

  • The cluster service is having SAPDatabase resource which was set as unmanaged, but still pacemaker was performing monitor operations on it which had caused a resource failure and triggered a service relocation.
  • Resource is unmanaged but cluster still moved it when it failed 3 times
  • After marking a resource as unmanaged, is it also required to manually disable monitoring on it?

Resolution

Red Hat Enterprise Linux 7
  • The issue (bz1501505) has been resolved with errata RHBA-2018:3055 with the following package(s): pacemaker-1.1.19-8.el7 or later.
Workaround

If there is a requirement to perform some maintenance activity on a resource and it is desired to ignore any errors on the specific resource while a maintenance activity is being performed with resource, you can disable the monitor action using the following methods: [How can I leave a resource in the configuration but no longer monitor it or have the cluster manage it in a RHEL 6 or 7 High Availability cluster with pacemaker?](https://access.redhat.com/solutions/1259763)

Root Cause

  • When a resource is unmanaged (by setting is-managed=false, or running pcs resource unmanage <resource>) the monitor operation isn't disabled by default. So, if the resource is marked as unmanaged and the monitor operation on resource gets failed, then pacemaker will still try to perform recovery action on it when you re-manage the resource

Once a resource is marked as unmanaged and there is a requirement to disable any further monitoring, recovery actions on it then please disable monitoring on a resource using steps in Resolution section.

Diagnostic Steps

  • The cib.xml file shows that there is following SAPDatabase resource configured on cluster nodes. This resource has is-managed attribute set to false which indicates its unmanaged by pacemaker:

      <primitive class="ocf" id="SAP_DB" provider="heartbeat" type="SAPDatabase">
        <instance_attributes id="SAP_DB-instance_attributes">
          <nvpair id="SAP_DB-instance_attributes-DBTYPE" name="DBTYPE" value="SYB"/>
          <nvpair id="SAP_DB-instance_attributes-SID" name="SID" value="P13"/>
          <nvpair id="SAP_DB-instance_attributes-STRICT_MONITORING" name="STRICT_MONITORING" value="true"/>
          <nvpair id="SAP_DB-instance_attributes-AUTOMATIC_RECOVER" name="AUTOMATIC_RECOVER" value="true"/>
        </instance_attributes>
        <operations>
          <op id="SAP_DB-start-timeout-120s" interval="0s" name="start" timeout="120s"/>
          <op id="SAP_DB-stop-timeout-120s" interval="0s" name="stop" timeout="120s"/>
          <op id="SAP_DB-monitor-interval-60s-timeout-50s" interval="60s" name="monitor" timeout="50s"/>
        </operations>
        <meta_attributes id="SAP_DB-meta_attributes">
          <nvpair id="SAP_DB-meta_attributes-is-managed" name="is-managed" value="false"/>		<----------
        </meta_attributes>
      </primitive>
    
  • The /var/log/messages file on cluster node shows that pacemaker was periodically performing the monitor, status check operations on above SAP_DB resource and these status checks were failing with return code of 1:

      node2 pengine[4822]:  warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1)
      [...]
      node2 pengine[4822]:  warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1)
      [...]
      node2 pengine[4822]:  warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1)
      [...]
      node2 pengine[4822]:  warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1)
      [...]
    
  • But during the following timestamps status check on above database resource got timed out which resulted in following error messages:

      node2 lrmd[4820]:  warning: child_timeout_callback: SAP_DB_monitor_60000 process (PID 46047) timed out
      node2 lrmd[4820]:  warning: operation_finished: SAP_DB_monitor_60000:46047 - timed out after 50000ms
      node2 crmd[4823]:    error: process_lrm_event: Operation SAP_DB_monitor_60000: Timed Out (node=node2h, call=166, timeout=50000ms)
      [...]
      node2 pengine[4822]:  warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1)
      [...]
    
  • After finding that resource SAP_DB has failed on node node2 for at least 3 times, pacemaker tried to relocate it to another node:

      node2 pengine[4822]:  warning: common_apply_stickiness: Forcing SAP_DB away from node2h after 3 failures (max=3)
      [...]
    
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.