How can I diagnose fence_ipmilan failures in RHEL 5, 6, 7, 8, or 9?
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9 with the High Availability Add On
- One or more cluster nodes with hardware that supports IPMI power management via
fence_ipmilan
Issue
fence_ipmilanfencing fails when a node goes missingfence_ipmilanstart or monitor checks fail with below error:
fence_ipmilan: Failed: Unable to obtain correct plug status or plug is not available
- A node is leaving the cluster and another node attempts to fence it with
fence_ipmilan, but fails.
Jul 23 15:45:28 node2 fenced[12831]: fencing node node1
Jul 23 15:45:28 node2 fenced[12831]: fence node1 dev 0.0 agent fence_ipmilan result: error from agent
Jul 23 15:45:28 node2 fenced[12831]: fence node1 failed
Jul 23 15:45:28 node2 fenced[12831]: fence node1 dev 0.0 agent fence_ipmilan result: error from agent
Jul 23 15:45:28 node2 fenced[12831]: fence node1 failed
- How can I diagnose failures to fence a node with
fence_ipmilan?
Resolution
fence_ipmilan may fail for a number of different reasons, and the specific Resolution depends on the circumstances. See Diagnostic Steps below for specific steps to help identify the cause and resolve the failure.
Resolutions to common problems:
- Ensure that
ipmitoolis installed on all nodes and that the ipmi device can be remotely accessed withipmitool. The fencing agentfence_ipmilanusesipmitoolto access the ipmi device. Ifipmitooldoes not work thenfence_ipmilanwill not work either. - Ensure the host details and credentials being passed to the
fence_ipmilanagent (either via the command line,/etc/cluster/cluster.confforcman-based clusters, or the CIB forpacemaker-based clusters) are accurate - Ensure that all nodes have connectivity to the hardware device in question, and are not blocked by firewalls or other network routing problems
- Ensure the privilege level of the user configured for use by
fence_ipmilanhas sufficient privileges to control the power state of the host - For iLO version 3 and 4 hardware, make sure the settings correctly account for delays that might take place in changing power state
- If the failure or "error from agent" is only occurring after a 20 second delay, it usually indicates the device is not responding quickly enough and the
timeout(-t, default of 20 seconds) is being reached. Try a highertimeout/-tvalue in thefence_ipmilanparameters. - Ensure that any configured (or default) timeout values are sufficient. Try experimenting with longer timeout values, which are described in more detail in the
fence_ipmilanman page. - Ensure ipmi is enabled on the console.
- When using
onoffas fencing method, the fenced server might stay powered off. In such cases, thepower_waitparameter of the fence device might need to be increased.
Diagnostic Steps
- Try running
fence_ipmilanfrom the command line with the appropriate options to see if a more verbose error is provided. It can be useful to run with-o statusif the host in question should not be powered off at this time.
# fence_ipmilan -a <ip addr/hostname of ipmi device> -l <login> -p <password> -o status -P
- It is also worth measuring the time the connection takes; this will help indicate approximate timeouts we will need to set for the STONITH device. This can be achieved by prepending the diagnostic commands by
time.
# time fence_ipmilan -a <ip addr/hostname of ipmi device> -l <login> -p <password> -o status -P
- If this fails, it may be useful to run
ipmitooldirectly with the corresponding options, to see if any additional output is given, or to determine how long any necessary timeouts should be. The following steps are advise foripmitooltroubleshooting.
- Confirm the Hardware IP:
# ipmitool lan print
- Check if the connection is open and if you see ipmi protocol enabled:
# nmap -sU -p 623 <Hardware IP>
- Use an admin user for checking the connection:
# ipmitool -H <Hardware IP> -I lanplus -U admin -P <password> chassis power status
- Compare the connection using the fence user:
# ipmitool -H <Hardware IP> -I lanplus -U <fence-user> -P <password> chassis power status
Please note: You may want to check if IPMI over lan is enabled on the Hardware settings and for the specific user on the user privilege. Check if the fence user is an admin role user or operator. If it's an operation, it must add "-L operator" to the above command.
- While manually running it can be useful if the issue is consistent, if it is seemingly random then setting up more verbose logging on the agent can prove useful. For example, we can configure the agents to verbose logging for a Red Hat Openstack environment this way:
# pcs stonith update ipmilan-osp-ctr01-north debug=/var/log/ipmilan-osp-ctr01-north.log verbose=1
# pcs stonith update ipmilan-osp-ctr02-north debug=/var/log/ipmilan-osp-ctr02-north.log verbose=1
# pcs stonith update ipmilan-osp-ctr03-north debug=/var/log/ipmilan-osp-ctr03-north.log verbose=1
-
Manually log into the IPMI device's web interface (if one is accessible) using the username/password supplied to the agent and the hostname or IP supplied, and ensure a connection can be established and the credentials are valid. If this step fails, then the correct details need to be determined and supplied to the agent.
-
Check whether the privilege level of the login account supplied is sufficient to control power state.
-
Try running the agent with and without lanplus (with and without
-Pfrom the command line, or withlanplus="1"andlanplus="0"in the configuration settings), and see if one works over the other. -
Try running the agent with a higher
power_waitsetting (-T <seconds>from the command line, orpower_wait="<seconds>"from within the configuration) orpower_timeout(RHEL 7 Update 1 and later;--power-timeout=<seconds>or-g <seconds>from the command line). -
If the failures are occurring with multiple verbosity options (
-vvv), try running it without verbosity options. -
To test actions such as off, on and reboot you can specify those instead of status as shown below:
# fence_ipmilan -a <ip addr/hostname of ipmi device> -l <login> -p <password> -o <reboot|off|on> -P
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.