[Troubleshooting] Gathering system baseline resource usage for IO performance issues

Updated

Issue

  • How do I gather the information needed to create an overview of system IO performance for a Red Hat support case?
  • My Technical Support Engineer has requested I setup the watcher-cron.bsh script on my server.

Environment

  • Red Hat Enterprise Linux 5, 6, or 7

Resolution

  • Red Hat support technicians may ask you to enable the watcher-cron.bsh script on your system to help diagnose performance related issues.
  1. Install sysstat on the machine in question (if not already installed)
# yum -y install sysstat
  1. Create the following script by creating a text file and naming it watcher-cron.bsh
#!/bin/bash
#*
#*----------------------------------------------------------------------------------------
#*
#* All software provided below is unsupported and provided as-is, without warranty 
#* of any kind.
#*
#* To the extent possible under law, Red Hat, Inc. has dedicated all copyright
#* to this software to the public domain worldwide, pursuant to the CC0 Public
#* Domain Dedication. This software is distributed without any warranty.
#* See <http://creativecommons.org/publicdomain/zero/1.0/>.
#*
#*----------------------------------------------------------------------------------------
#*
#* Maintainer: bubrown@redhat.com
#
# watcher-cron.bsh V02.801
#
# $1 = interval
# $2 = interations
# $3 = log directory
# $4 = days of logs to keep
# $5 = compression level (optional)
#      0 - no compression           (previous default)
#      1 - end-of-batch compression (new default, about a 10:1 compression ratio observed)
#      2 - '1' + use gzip, if available, inline with top output creation
#      3 - '2' + use gzip all commands except vmstat
#
# crontab -u root -e
# 21 * * * * bash  /<path>/watcher-cron.bsh 30 120 /<logdir>  2  1
#
if [ "$4" == "" ]
then
echo " "
echo "missing arguments:"
echo "------------------"
echo " arg1 = interval,    e.g. 1             (in seconds)"
echo " arg2 = interations, e.g. 720           (count of samples to take)"
echo "                                        (from cron arg1 x arg2 = 3600 seconds, 1hr)"
echo " arg3 = log path,    e.g. /tmp/watcher  (where to create log files)"
echo " arg4 = days,        e.g. 2             (anything older than this is purged from log path)"
echo "------------------"
echo " "
exit
fi
# -p is exclusive of -x on RHEL5*
# -p is required along with -x on RHEL6* to get partitions included
# -p is required along with -x on RHEL7* to get partitions included
_pflg=""
if [[ `uname -a` == *2.6.32* || `uname -a` == *3.10.0* ]]
then
 _pflg="-p"
fi
_compress_hourly=1
_compress_top=""
_compress_all=""
_ext_top=""
_ext_all=""
if [ "$5" != "" ]
then
  if [ $5 -eq 0 ]
  then
     _compress_hourly=0
  fi
  if [ -e "/usr/bin/gzip" ]
  then
    if [ $5 -ge 2 ]
    then
      _compress_top=" | gzip - "
      _ext_top=".gz"
    fi
    if [ $5 -ge 3 ]
    then
      _compress_all=" | gzip - "
      _ext_all=".gz"
     fi
  fi
fi
_wait_til_complete=0
if [ $_compress_hourly -gt 0 ]
then
   _wait_til_complete=1
fi
_time=$(date +%Y%m%d-%H%M%S)
_name=$(uname -a)
#echo $1 seconds x $2 intervals, at ${_time}
echo '#TIME=     '${_time}         >  $3/${_time}-info-vmstat.log
echo '#INTERVAL= '$1               >> $3/${_time}-info-vmstat.log
echo '#COUNT=    '$2               >> $3/${_time}-info-vmstat.log
echo '#UNAME=   "'${_name}'"'      >> $3/${_time}-info-vmstat.log
#
echo '#TIME=     '${_time}         >  $3/${_time}-info-vmstat-d.log
echo '#INTERVAL= '$1               >> $3/${_time}-info-vmstat-d.log
echo '#COUNT=    '$2               >> $3/${_time}-info-vmstat-d.log
echo '#UNAME=   "'${_name}'"'      >> $3/${_time}-info-vmstat-d.log
#
 eval "echo '#UNAME=   \"${_name}\"'      ${_compress_all} >  $3/${_time}-info-iostat.log${_ext_all}"
#eval "iostat $1 $2 -t -k -x ${_pflg} -n  ${_compress_all} >> $3/${_time}-info-iostat.log${_ext_all} &"
 eval "iostat $1 $2 -t -k -x ${_pflg}     ${_compress_all} >> $3/${_time}-info-iostat.log${_ext_all} &" ; iopid=$!
       vmstat $1 $2                                        >> $3/${_time}-info-vmstat.log &               vmpid=$!
       vmstat $1 $2 -d                                     >> $3/${_time}-info-vmstat-d.log &             vdpid=$!
 eval "top  -b -d $1 -n $2                ${_compress_top} >  $3/${_time}-info-top.log${_ext_top} &"    ; tppid=$!
 eval "mpstat $1 $2 -P ALL                ${_compress_all} >  $3/${_time}-info-mpstat.log${_ext_all} &" ; mppid=$!
#
if [ $_wait_til_complete -gt 0 ]
then
  sleep $(( $1 * $2 ))
fi
find $3/*-info-*.log* -mtime +$4 -exec rm -f {} \;
if [ $_compress_hourly -gt 0 ]
then
  while sleep 1; do ps -o pid= -p $iopid || break; done
  while sleep 1; do ps -o pid= -p $vmpid || break; done
  while sleep 1; do ps -o pid= -p $vdpid || break; done
  while sleep 1; do ps -o pid= -p $tppid || break; done
  while sleep 1; do ps -o pid= -p $mppid || break; done
  ls -lh     $3/${_time}-info*.log*                    > $3/${_time}-info-files.log
  ls -1c     $3 | grep ${_time}-info | grep log        > $3/${_time}-files.txt
  tar -C $3 -T $3/${_time}-files.txt --remove-files -czf $3/${_time}-info-watcher.tar.gz
  find $3/*-info-watcher.tar.gz -mtime +$4 -exec rm -f {} \;
  rm -f $3/${_time}-files.txt
fi
  1. Add the executable flag to the script
# chmod +x /path/to/watcher-cron.bsh
  • Note: replace /path/to with the script's path in your environment.
  1. Create a new crontab entry for your root user to call the watcher-cron.bsh script:
# crontab -u root -e

21 * * * * bash  /path/to/watcher-cron.bsh 10 361 <logdir> 4  1
  • Note: replace /path/to with the script's path in your environment.
  • Note: replace <logdir> with the directory where you would like to store the log files.
  • It is recommended that both the script and log directory not be located on any storage with the suspected problem.
  • Your support technician may ask you to adjust the input variables on the command line.
  1. Allow the script to run for some time so the diagnostic information can be gathered. You can then tar up the data by running the following command:
# tar -cvjf /tmp/watcher-cron.tar.bz2 <logdir>
  1. Post the tar'd files to the case along with any specific times when performance issues were noticed.

Root Cause

The watcher-cron.bsh is a set of commands that has evolved over time which collect finer granularity information on the system vis-a-vis a time-wise fashion than other methods (like cron-based sar).

The script uses standard utilities which are nominally already installed on Linux systems avoiding change management requests.

Performance issues may be constant and on-going, or may be noticed or seen on a periodic basis. In both cases, system background information should be gathered as a baseline before taking any action. This allows us to see if changes have had a measurable impact and whether the impact was positive or negative (or none at all).

Diagnostic Steps

This is a passive data collection process, simply collecting the data from the running system. If we believe we understand what triggers the performance issue then we could do an active data collection, namely introduce a program or script that generates a point load simulation of what has been identified as triggering the issue and gathering data to show the simulation/point load is causing similar behavior.

The size of the files created varies by number of storage devices present, system activity, and number of seconds within each sample and how long watcher is run. However, with ~35 LUNs, a moderate amount of production activity, a 10 second sample size and a nominal run of 1 hour the total file of all the log files is ~30-50 Megabytes per hour of logging. This is just a ball-park figure, your system may generate significantly larger files. With the default compression of files at the end of each watcher run, a compressed file of hourly logs will be in the 5-8 Megabyte range.

If your top file, the largest file collected, is on the large size, then adding a final argument (argument number 5) of 2 to the command line will cause top command output to be compressed as its collected. This can reduce the size of top's log file footprint on the system. Typically with in-line compression the size of the file is reduced 10 fold. The down side is that compressed data isn't output after each sample but at some later time. So there can be data loss if there is a sudden shutdown or panic. Still, using compression option 2 or even 3 can reduce the amount of IO to disk during data collection at a slight increase of CPU usage.

Category
Components
Article Type