An 'unmanaged' pacemaker resource monitor failure can trigger recovery to occur
Environment
- Red Hat Enterprise Linux 7 with High Availability Add-Ons
- Red Hat Enterprise Linux 6.5 with High Availability Add-Ons
pacemakercluster
Issue
- The cluster service is having
SAPDatabaseresource which was set asunmanaged, but stillpacemakerwas performing monitor operations on it which had caused a resource failure and triggered a service relocation. - Resource is unmanaged but cluster still moved it when it failed 3 times
- After marking a resource as
unmanaged, is it also required to manually disable monitoring on it?
Resolution
Red Hat Enterprise Linux 7
- The issue (bz1501505) has been resolved with errata RHBA-2018:3055 with the following package(s):
pacemaker-1.1.19-8.el7or later.
Workaround
If there is a requirement to perform some maintenance activity on a resource and it is desired to ignore any errors on the specific resource while a maintenance activity is being performed with resource, you can disable the monitor action using the following methods: [How can I leave a resource in the configuration but no longer monitor it or have the cluster manage it in a RHEL 6 or 7 High Availability cluster with pacemaker?](https://access.redhat.com/solutions/1259763)
Root Cause
- When a resource is unmanaged (by setting
is-managed=false, or runningpcs resource unmanage <resource>) the monitor operation isn't disabled by default. So, if the resource is marked as unmanaged and the monitor operation on resource gets failed, thenpacemakerwill still try to perform recovery action on it when you re-manage the resource
Once a resource is marked as unmanaged and there is a requirement to disable any further monitoring, recovery actions on it then please disable monitoring on a resource using steps in Resolution section.
Diagnostic Steps
-
The
cib.xmlfile shows that there is following SAPDatabase resource configured on cluster nodes. This resource hasis-managedattribute set tofalsewhich indicates its unmanaged by pacemaker:<primitive class="ocf" id="SAP_DB" provider="heartbeat" type="SAPDatabase"> <instance_attributes id="SAP_DB-instance_attributes"> <nvpair id="SAP_DB-instance_attributes-DBTYPE" name="DBTYPE" value="SYB"/> <nvpair id="SAP_DB-instance_attributes-SID" name="SID" value="P13"/> <nvpair id="SAP_DB-instance_attributes-STRICT_MONITORING" name="STRICT_MONITORING" value="true"/> <nvpair id="SAP_DB-instance_attributes-AUTOMATIC_RECOVER" name="AUTOMATIC_RECOVER" value="true"/> </instance_attributes> <operations> <op id="SAP_DB-start-timeout-120s" interval="0s" name="start" timeout="120s"/> <op id="SAP_DB-stop-timeout-120s" interval="0s" name="stop" timeout="120s"/> <op id="SAP_DB-monitor-interval-60s-timeout-50s" interval="60s" name="monitor" timeout="50s"/> </operations> <meta_attributes id="SAP_DB-meta_attributes"> <nvpair id="SAP_DB-meta_attributes-is-managed" name="is-managed" value="false"/> <---------- </meta_attributes> </primitive> -
The
/var/log/messagesfile on cluster node shows that pacemaker was periodically performing the monitor, status check operations on aboveSAP_DBresource and these status checks were failing with return code of 1:node2 pengine[4822]: warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1) [...] node2 pengine[4822]: warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1) [...] node2 pengine[4822]: warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1) [...] node2 pengine[4822]: warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1) [...] -
But during the following timestamps status check on above database resource got timed out which resulted in following error messages:
node2 lrmd[4820]: warning: child_timeout_callback: SAP_DB_monitor_60000 process (PID 46047) timed out node2 lrmd[4820]: warning: operation_finished: SAP_DB_monitor_60000:46047 - timed out after 50000ms node2 crmd[4823]: error: process_lrm_event: Operation SAP_DB_monitor_60000: Timed Out (node=node2h, call=166, timeout=50000ms) [...] node2 pengine[4822]: warning: unpack_rsc_op_failure: Processing failed op monitor for SAP_DB on node2h: unknown error (1) [...] -
After finding that resource
SAP_DBhas failed on nodenode2for at least 3 times, pacemaker tried to relocate it to another node:node2 pengine[4822]: warning: common_apply_stickiness: Forcing SAP_DB away from node2h after 3 failures (max=3) [...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.