Collecting supplemental system utilization statistics for fence events or performance problems in RHEL High Availability or Resilient Storage clusters

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 5, 6, 7, 8 or 9
  • High Availability or Resilient Storage Add On

Issue

Resolution

Note  The script attached to ↴ this article for download was formerly named fencemon-cron.bsh, but has since be renamed to ha-resourcemon.sh.
Caution!  The filename upon download is ha-resourcemon_1.sh. This must be modified to have executable permissions added to the file.
  • chmod +x ha-resourcemon_1.sh
Note  This script will produce large amounts of data in the directory specified, which may require special attention to prevent it from filling up the file system. Ensure there is adequate free space available in the specified directory. This script, at the timing variables listed below, often produces 30 - 50 Mb per hour of collection on typical systems, and may produce much larger data sets on systems that run many processes, handle large workloads, have many network connections, or otherwise may have lengthier outputs for any of the monitored commands. Please contact Red Hat Global Support Services for assistance with accounting for the space requirements if this is a concern.
  1. Install sysstat, ethtool, net-tools and procps on the host that will have ha-resourcemon.sh installed on it.

yum install -y sysstat ethtool net-tools procps

Note: On Red Hat Enterprise Linux (RHEL) 7 and 8 the procps-ng package will be installed.

  1. Find out which interface the cluster is using for cluster communication because we edit the script below to save data for that network interface. This is the network over which each node resolves each other node's nodename not hostname. Use one of the following commands that will output the IP address used for cluster communication. Then use that IP address to find that interface in the output ip addr (execute the command applicable to your cluster environment). These commands below can show either IP addresses for host running the command or an IP address for each cluster node.

For RHEL 6/7/8/9 pacemaker cluster utilizing corosync, use the command below :

# corosync-quorumtool  -il

For RHEL 6 cluster utilizing cman, use the command below:

# cman_tool status | grep "Node addresses"
# cman_tool nodes -a -F "id,name,addr"
  1. Save the attached file below to a file named: ha-resourcemon.sh. After the file is saved, then edit the ifaces variable to refer to the physical interface(s) associated with the heartbeat network that was found in step 2.

  2. Add the executable flag to the script:

chmod +x /path/to/ha-resourcemon.sh

  1. Create a new crontab entry for your root user to call the ha-resourcemon.sh script

crontab -u root -e

1 * * * * /bin/bash  /path/to/ha-resourcemon.sh 20 181 /<logdir> 2
  1. Allow the script to run until a fence event is experienced or after GFS2 performance has occurred(or has gone on long enough for data to be captured). You can then tar up the data by running the following command

tar -cvjf /tmp/$(hostname)-ha-resourcemon.sh.tar.bz2 /

  1. Post the following below from all nodes to the support case.
  • The ha-resourcemon.sh(formerly called fencemon-cron.bsh) files archived and compressed with the tar utility from all cluster nodes.
  • An sos report from all of the cluster nodes.
  • A description of what was happening on the cluster node(s) when the issue occurred or what job was running when the issue occurred.
  • The name of the host or hosts that experienced an issue.
  • How long did the issue occur if this was a performance issue?
  • The date and time when the issue occurred.

Root Cause

The script ha-resourcemon.sh(formerly called fencemon-cron.bsh) captures additional data for diagnosing reocurring fence events, membership issues such as tokens lost, or GFS2 performance issues. Here is some articles that go into details about troubleshooting those issues:

The script uses standard linux utilities which are nominally already installed on most systems which will prevent the host from having additional packages installed.

Diagnostic Steps

The following information will be captured by the script:

  • vmstat
  • mpstat
  • iostat -tkx
  • top -b
  • ps aux
  • netstat -s
  • ethtool -S
  • (optional) pidstack output
    • Only available if the --pidstack option is included as the final option:
      $ /path/to/ha-resourcemon.sh 20 181 /<logdir> 2 --pidstack

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.