pcs shows several "Failed actions" for my fence_vmware_soap devices with "'unknown error' (1)" and status=Timed Out" in RHEL High Availability or Resilient Storage Cluster with Pacemaker

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 6 with the High Availability Add On
  • Red Hat Enterprise Linux (RHEL) 7 with the High Availability Add On
  • pacemaker
  • One or more stonith devices using fence_vmware_soap as the agent

Issue

  • What do the Failed actions' mean in the output of pcs status` for my stonith devices?
  • How to cleanup the "Failed actions:" seen in "pcs status" command for fence_vmware_soap devices?
  • With fence_vmware_soap, I see multiple failed start actions in pcs output:
#pcs status

[..... ]

Full list of resources:

 vmfence1	(stonith:fence_vmware_soap):	Started node1.example.com 
 vmfence2	(stonith:fence_vmware_soap):	Started node2.example.com

Failed actions:
    vmfence1_start_0 on node1.example.com 'unknown error' (1): call=13, status=Timed Out, last-rc-change='Fri Jan 10 19:36:00 2014', queued=20189ms, exec=0ms
    vmfence2_start_0 on node2.example.com 'unknown error' (1): call=15, status=Timed Out, last-rc-change='Fri Jan 10 19:36:00 2014', queued=20103ms, exec=0ms
    vmfence1_start_0 on node1.example.com 'unknown error' (1): call=13, status=Timed Out, last-rc-change='Fri Jan 10 19:36:35 2014', queued=22046ms, exec=0ms
    vmfence2_start_0 on node2.example.com 'unknown error' (1): call=15, status=Timed Out, last-rc-change='Fri Jan 10 19:36:36 2014', queued=21042ms, exec=0ms

Resolution

  • In order to cleanup the "Failed actions" in the pcs output, use the command pcs resource cleanup:

    # pcs resource cleanup vmfence1
    

    NOTE: This only cleans up previously encountered errors. If pcs continues to show more, it means failures continue to occur, and the below steps may be necessary.

  • If there is only a single vCenter host for all nodes, and the configuration was set up with a separate stonith device for each node, then consider replacing the multiple devices with just a single device. For example:

    # pcs stonith delete vmfence1
    # pcs stonith delete vmfence2
    # pcs stonith create vmfence fence_vmware_soap pcmk_host_map="node1.example.com:vm-node1,node2.example.com:vm-node2"  ipaddr=10.10.10.10 ssl=1 login=root passwd=redhat
    

    NOTE: The previous example utilizes a pcmk_host_map to map nodename to vm name, as is often required with virtualized nodes. This may be ommitted if nodename and vm name match.

  • In addition, or as an alternative, it may be useful to adjust the timeouts for the device(s) in question so that operations do not time out. For example, when creating the device:

    # pcs stonith create vmfence fence_vmware_soap pcmk_monitor_timeout=120s pcmk_host_map="node1.example.com:vm-node1,node2.example.com:vm-node2"  ipaddr=10.10.10.10 ssl=1 login=root passwd=redhat
    

    There are various timeouts that can be used to control the behavior of stonith devices.


If a shared stonith device is used for multiple cluster nodes, then make sure that there exists a mechanism that will prevent multiple cluster nodes from being fenced, which could result in both cluster nodes being down. On RHEL 7 an attribute called auto_tie_breaker can be defined in the /etc/corosync/corosync.conf.


There is a bugzilla open to see if there is a way to increase the time it takes for fence_vmware_soap to perform a monitor operation: This content is not included.Bug 1436429 – fence_vmware_soap: Make monitor less prone to failure/timeout. The fencing agent fence_vmware_soap is known to be slow because of the amount of data that is downloaded when logging into the fence device.

There is another fencing agent for VMware that is much faster and responsive called: fence_vmware_rest. For more information see the following article: How do I configure a stonith device using agent fence_vmware_rest in a RHEL 7 or 8 High Availability cluster with pacemaker ?

Root Cause

When stonith-ng starts up, it will attempt to perform a start operation on the different configured stonith devices by doing list and monitor operations to determine which nodes are able to control which nodes through which devices. With a configuration that utilizes one device per-node but all connect to the same underlying host or device, having these operations all performed simultaneously may cause the operations to all be slower.

With clusters comprised of VMWare VMs as the nodes, typically there is a single vCenter host that is used to manage all of the nodes. If each node had its own separate stonith device but specified that same vCenter ipaddr or hostname, all of these devices may be connecting to the vCenter at once when the cluster is starting. Each session may be made slower when there are multiple connections, and depending on the configuration a timeout may be reached before the operation is complete.

stonith-ng allows one to specify a timeout value for each of the available operations: pcmk_reboot_timeout, pcmk_monitor_timeout, pcmk_status_timeout, pcmk_list_timeout, pcmk_off_timeout, and pcmk_on_timeout, and these default to the cluster property default-action-timeout, which defaults to 20s. So, if all devices are connecting to the vCenter at once (and possibly multiple times), it may cause the sessions to last longer than 20 seconds and leads to these timeout errors being seen in pcs.

Switching to a single device that manages all nodes (and thus less sessions are started simultaneously), and/or adjusting the timeouts can help avoid this.

stonith-ng will retry the start a number of times before giving up, so you may still see the device eventually moving to a Started state. However the above recommendations can still be useful to avoid unnecessary delays.

Diagnostic Steps

  • Manually call fence_vmware_soap -o status -n <vmname> [...] for one node and time how long it takes. Now perform the same operation (on one or multiple VMs) multiple times from different nodes simultaneously. Observe whether the time is longer for each than in the single-session case (it often is). This demonstrates the value of having a single device managing all nodes.
  • Does the failed monitor fail at the same time there is an increase load on host running VMware VCenter. If there are periods when the response time from the vCenter SOAP API for the list action is slow this could be due to network issues, a load on the vCenter server, or various other factors outside the cluster nodes control.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.