Collecting supplemental system utilization statistics for fence events or performance problems in RHEL High Availability or Resilient Storage clusters
Environment
- Red Hat Enterprise Linux 5, 6, 7, 8 or 9
- High Availability or Resilient Storage Add On
Issue
- My cluster is experiencing repeated fence events and I need a way to collect the system utilization data described in similar way described here during the event: "What can I do to determine what caused the token to be lost in my RHEL 5 cluster?" or "What can I do to determine what caused "A processor failed, forming new configuration" and/or a node to be fenced in a High Availability cluster with Pacemaker?".
- My Technical Support Engineer has requested I setup the watcher-cron.bsh script on my cluster.
- How to setup
ha-resourcemon.sh(formerly calledfencemon-cron.bsh) script for capturing utilization data on cluster nodes? - What information is required for troubleshooting performance issues in cluster nodes or with clustered filesystem?
- How can I collect system utilization statistics on the nodes during fence events or slowness in performance in RHEL cluster?
- My GFS2 filesystem is experiencing slowness or hangs?
Resolution
Note The script attached to ↴ this article for download was formerly named fencemon-cron.bsh, but has since be renamed toha-resourcemon.sh.Caution! The filename upon download is ha-resourcemon_1.sh. This must be modified to have executable permissions added to the file.
chmod +x ha-resourcemon_1.shNote This script will produce large amounts of data in the directory specified, which may require special attention to prevent it from filling up the file system. Ensure there is adequate free space available in the specified directory. This script, at the timing variables listed below, often produces 30 - 50 Mb per hour of collection on typical systems, and may produce much larger data sets on systems that run many processes, handle large workloads, have many network connections, or otherwise may have lengthier outputs for any of the monitored commands. Please contact Red Hat Global Support Services for assistance with accounting for the space requirements if this is a concern.
- Install
sysstat,ethtool,net-toolsandprocpson the host that will haveha-resourcemon.shinstalled on it.
yum install -y sysstat ethtool net-tools procps
Note: On Red Hat Enterprise Linux (RHEL) 7 and 8 the procps-ng package will be installed.
- Find out which interface the cluster is using for cluster communication because we edit the script below to save data for that network interface. This is the network over which each node resolves each other node's nodename not hostname. Use one of the following commands that will output the IP address used for cluster communication. Then use that IP address to find that interface in the output
ip addr(execute the command applicable to your cluster environment). These commands below can show either IP addresses for host running the command or an IP address for each cluster node.
For RHEL 6/7/8/9 pacemaker cluster utilizing corosync, use the command below :
# corosync-quorumtool -il
For RHEL 6 cluster utilizing cman, use the command below:
# cman_tool status | grep "Node addresses"
# cman_tool nodes -a -F "id,name,addr"
-
Save the attached file below to a file named:
ha-resourcemon.sh. After the file is saved, then edit theifacesvariable to refer to the physical interface(s) associated with the heartbeat network that was found in step 2. -
Add the executable flag to the script:
chmod +x /path/to/ha-resourcemon.sh
- Create a new
crontabentry for your root user to call theha-resourcemon.shscript
crontab -u root -e
1 * * * * /bin/bash /path/to/ha-resourcemon.sh 20 181 /<logdir> 2
-
Replace
/path/towith the script's path in your environment -
Your support technician may ask you to adjust the timing variables within this line
-
Replace
/<logdir>with the directory where you would like to store our log files. -
If crontab is unavailable, it is also possible to use a systemd timer to collect this information
Using a systemd timer to collect supplemental system utilization statistics in RHEL High Availability or Resilient Storage Clusters
- Allow the script to run until a fence event is experienced or after GFS2 performance has occurred(or has gone on long enough for data to be captured). You can then
tarup the data by running the following command
tar -cvjf /tmp/$(hostname)-ha-resourcemon.sh.tar.bz2 /
- Post the following below from all nodes to the support case.
- The
ha-resourcemon.sh(formerly calledfencemon-cron.bsh) files archived and compressed with thetarutility from all cluster nodes. - An
sos reportfrom all of the cluster nodes. - A description of what was happening on the cluster node(s) when the issue occurred or what job was running when the issue occurred.
- The name of the host or hosts that experienced an issue.
- How long did the issue occur if this was a performance issue?
- The date and time when the issue occurred.
Root Cause
The script ha-resourcemon.sh(formerly called fencemon-cron.bsh) captures additional data for diagnosing reocurring fence events, membership issues such as tokens lost, or GFS2 performance issues. Here is some articles that go into details about troubleshooting those issues:
- "What can I do to determine what caused the token to be lost and/or a node to be fenced in my RHEL 5 High Availability cluster?"
- "What can I do to determine what caused "A processor failed, forming new configuration" and/or a node to be fenced in a RHEL 6 or 7 High Availability cluster?"
- "Is my GFS2 slowdown a file system problem or a storage problem?"
- "My GFS2 filesystem is slow. How can I diagnose and make it faster?"
The script uses standard linux utilities which are nominally already installed on most systems which will prevent the host from having additional packages installed.
Diagnostic Steps
The following information will be captured by the script:
vmstatmpstatiostat -tkxtop -bps auxnetstat -sethtool -S- (optional) pidstack output
- Only available if the
--pidstackoption is included as the final option:
$ /path/to/ha-resourcemon.sh 20 181 /<logdir> 2 --pidstack
- Only available if the
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.