How can I diagnose fence_ipmilan failures in RHEL 5, 6, 7, 8, or 9?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9 with the High Availability Add On
  • One or more cluster nodes with hardware that supports IPMI power management via fence_ipmilan

Issue

  • fence_ipmilan fencing fails when a node goes missing
  • fence_ipmilan start or monitor checks fail with below error:
fence_ipmilan: Failed: Unable to obtain correct plug status or plug is not available   
  • A node is leaving the cluster and another node attempts to fence it with fence_ipmilan, but fails.
Jul 23 15:45:28 node2 fenced[12831]: fencing node node1
Jul 23 15:45:28 node2 fenced[12831]: fence node1 dev 0.0 agent fence_ipmilan result: error from agent
Jul 23 15:45:28 node2 fenced[12831]: fence node1 failed
Jul 23 15:45:28 node2 fenced[12831]: fence node1 dev 0.0 agent fence_ipmilan result: error from agent
Jul 23 15:45:28 node2 fenced[12831]: fence node1 failed
  • How can I diagnose failures to fence a node with fence_ipmilan?

Resolution

fence_ipmilan may fail for a number of different reasons, and the specific Resolution depends on the circumstances. See Diagnostic Steps below for specific steps to help identify the cause and resolve the failure.

Resolutions to common problems:

  • Ensure that ipmitool is installed on all nodes and that the ipmi device can be remotely accessed with ipmitool. The fencing agent fence_ipmilan uses ipmitool to access the ipmi device. If ipmitool does not work then fence_ipmilan will not work either.
  • Ensure the host details and credentials being passed to the fence_ipmilan agent (either via the command line, /etc/cluster/cluster.conf for cman-based clusters, or the CIB for pacemaker-based clusters) are accurate
  • Ensure that all nodes have connectivity to the hardware device in question, and are not blocked by firewalls or other network routing problems
  • Ensure the privilege level of the user configured for use by fence_ipmilan has sufficient privileges to control the power state of the host
  • For iLO version 3 and 4 hardware, make sure the settings correctly account for delays that might take place in changing power state
  • If the failure or "error from agent" is only occurring after a 20 second delay, it usually indicates the device is not responding quickly enough and the timeout (-t, default of 20 seconds) is being reached. Try a higher timeout / -t value in the fence_ipmilan parameters.
  • Ensure that any configured (or default) timeout values are sufficient. Try experimenting with longer timeout values, which are described in more detail in the fence_ipmilan man page.
  • Ensure ipmi is enabled on the console.
  • When using onoff as fencing method, the fenced server might stay powered off. In such cases, the power_wait parameter of the fence device might need to be increased.

Diagnostic Steps

  • Try running fence_ipmilan from the command line with the appropriate options to see if a more verbose error is provided. It can be useful to run with -o status if the host in question should not be powered off at this time.
# fence_ipmilan -a <ip addr/hostname of ipmi device> -l <login> -p <password> -o status -P
  • It is also worth measuring the time the connection takes; this will help indicate approximate timeouts we will need to set for the STONITH device. This can be achieved by prepending the diagnostic commands by time.
# time fence_ipmilan -a <ip addr/hostname of ipmi device> -l <login> -p <password> -o status -P
  • If this fails, it may be useful to run ipmitool directly with the corresponding options, to see if any additional output is given, or to determine how long any necessary timeouts should be. The following steps are advise for ipmitool troubleshooting.
  1. Confirm the Hardware IP:
# ipmitool lan print
  1. Check if the connection is open and if you see ipmi protocol enabled:
# nmap -sU -p 623 <Hardware IP>
  1. Use an admin user for checking the connection:
# ipmitool -H <Hardware IP> -I lanplus -U admin -P <password> chassis power status
  1. Compare the connection using the fence user:
# ipmitool -H <Hardware IP> -I lanplus -U <fence-user> -P <password> chassis power status

Please note: You may want to check if IPMI over lan is enabled on the Hardware settings and for the specific user on the user privilege. Check if the fence user is an admin role user or operator. If it's an operation, it must add "-L operator" to the above command.

  • While manually running it can be useful if the issue is consistent, if it is seemingly random then setting up more verbose logging on the agent can prove useful. For example, we can configure the agents to verbose logging for a Red Hat Openstack environment this way:
# pcs stonith update ipmilan-osp-ctr01-north debug=/var/log/ipmilan-osp-ctr01-north.log verbose=1
# pcs stonith update ipmilan-osp-ctr02-north debug=/var/log/ipmilan-osp-ctr02-north.log verbose=1
# pcs stonith update ipmilan-osp-ctr03-north debug=/var/log/ipmilan-osp-ctr03-north.log verbose=1
  • Manually log into the IPMI device's web interface (if one is accessible) using the username/password supplied to the agent and the hostname or IP supplied, and ensure a connection can be established and the credentials are valid. If this step fails, then the correct details need to be determined and supplied to the agent.

  • Check whether the privilege level of the login account supplied is sufficient to control power state.

  • Try running the agent with and without lanplus (with and without -P from the command line, or with lanplus="1" and lanplus="0" in the configuration settings), and see if one works over the other.

  • Try running the agent with a higher power_wait setting (-T <seconds> from the command line, or power_wait="<seconds>" from within the configuration) or power_timeout (RHEL 7 Update 1 and later; --power-timeout=<seconds> or -g <seconds> from the command line).

  • If the failures are occurring with multiple verbosity options (-vvv), try running it without verbosity options.

  • To test actions such as off, on and reboot you can specify those instead of status as shown below:

# fence_ipmilan -a <ip addr/hostname of ipmi device> -l <login> -p <password> -o <reboot|off|on> -P
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.