A stonith device is failing to start and/or reporting "Timed Out" errors in a RHEL High Availability cluster with pacemaker
Environment
- Red Hat Enterprise Linux 6, 7, or 8 (with the High Availability Add-on)
- Pacemaker
Issue
- The
pcs statuscommand shows"Timed Out"errors for one or more stonith devices.
fence_node1_start_0 on node2.example.com 'unknown error' (1): call=48, status=Timed Out, last-rc-change='Fri Sep 5 15:50:46 2014', queued=21022ms, exec=0ms
- Stonith device monitor or start operations are timing out and reporting errors similar to those shown below.
Jun 01 11:36:07 node1.example.com crmd[2807]: notice: process_lrm_event: Operation fence_node_5356_monitor_0: not running (node=node1.example.com, call=311, rc=7, cib-upda...nfirmed=true)
Jun 01 11:36:27 node1.example.com stonith-ng[2803]: notice: stonith_action_async_done: Child process 3114 performing action 'monitor' timed out with signal 15
Jun 01 11:36:27 node1.example.com stonith-ng[2803]: notice: log_operation: Operation 'monitor' [3114] for device 'fence_node2' returned: -62 (Timer expired)
Jun 01 11:36:28 node1.example.com crmd[2807]: error: process_lrm_event: Operation fence_node_node2_start_0: Timed Out (node=node1.example.com, call=312, timeout=20000ms)
- Stonith devices are stuck in a
Stoppedstate with"Timed Out"errors.
Resolution
Use one of the several available options or parameters that control how long a stonith device has to complete its operations. Below are a few examples that show how to configure these options. In some instances, two examples are given. In these cases, one example is for creating new stonith devices, while the other is for updating existing stonith devices.
- If a specific operation for a specific device or just a few devices is timing out, configure
pcmk_<operation>_timeoutin each device's attributes (e.g.,pcmk_monitor_timeout,pcmk_list_timeout,pcmk_status_timeout). See also: Table 5.2. Advanced Properties of Fencing Devices.
# # Creation command format: # pcs stonith create <device> <agent> <attributes> pcmk_<operation>_timeout=<time>
# # Example:
# pcs stonith create node1_ipmi fence_ipmilan ipaddr=node1-ipmi.example.com lanplus=1 login=admin passwd='a2@7czD44#pQrs7UX.' pcmk_monitor_timeout=120s
# # Update command format: # pcs stonith update <device> pcmk_monitor_timeout=<time>
# # Example:
# pcs stonith update node1_ipmi pcmk_monitor_timeout=120s
- To change the timeout for the
startand/orstopoperations for a specific device, update the stonith device as follows:
# # Command format: # pcs stonith update <device> op <operation> timeout=<time>
# # Example:
# pcs stonith update node1_ipmi op start timeout=120s
- To change the default timeout that all
stonithdevices will use for all operations, adjust the resource operation default value fortimeout.
# # Command format: # pcs resource op defaults timeout=<time>
# # Example:
# pcs resource op defaults timeout=120s
- To change how long a fencing operation (
rebootoroffwhen a node must be removed from the cluster) has to complete before timing out, change the cluster propertystonith-timeout
# # Example: # pcs property set stonith-timeout=<time>
# pcs property set stonith-timeout=120s
- If the agent is reporting an error such as
"Connection timed out"rather thanstonithdorpacemaker-fencedreporting that an operation timed out, then attributes that control the internal behavior of the agent may need to be adjusted.power_timeoutwould be the most relevant for such a"Connection timed out"error, but there are other attributes as well, such aslogin_timeoutandshell_timeout. Check the man page for the specific agent for more details.
# # Creation command format: # pcs stonith create <device> <agent> <attributes> power_timeout=<seconds>
# # Example:
# pcs stonith create node1_ipmi fence_ipmilan ipaddr=node1-ipmi.example.com lanplus=1 login=admin passwd='a2@7czD44#pQrs7UX.' power_timeout=120
# # Creation command format: # pcs stonith update <device> power_timeout=<seconds>
# # Example
# pcs stonith update node1_ipmi power_timeout=120
- Alternatively, identify reasons why the operation may not be completing as quickly as desired and address them.
Root Cause
There are various settings that control how long stonithd will give an operation for a stonith device before considering it timed out, and the agents themselves also support several timeout options that control their internal behavior. When timeout issues are observed, these settings can be tuned to correct the issue, or the device can be adjusted or modified in whatever way necessary to address the slow responses.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.