[Troubleshooting] Gathering system baseline resource usage for IO performance issues
Issue
- How do I gather the information needed to create an overview of system IO performance for a Red Hat support case?
- My Technical Support Engineer has requested I setup the watcher-cron.bsh script on my server.
Environment
- Red Hat Enterprise Linux 5, 6, or 7
Resolution
- Red Hat support technicians may ask you to enable the
watcher-cron.bshscript on your system to help diagnose performance related issues.
- Install
sysstaton the machine in question (if not already installed)
# yum -y install sysstat
- Create the following script by creating a text file and naming it
watcher-cron.bsh
#!/bin/bash
#*
#*----------------------------------------------------------------------------------------
#*
#* All software provided below is unsupported and provided as-is, without warranty
#* of any kind.
#*
#* To the extent possible under law, Red Hat, Inc. has dedicated all copyright
#* to this software to the public domain worldwide, pursuant to the CC0 Public
#* Domain Dedication. This software is distributed without any warranty.
#* See <http://creativecommons.org/publicdomain/zero/1.0/>.
#*
#*----------------------------------------------------------------------------------------
#*
#* Maintainer: bubrown@redhat.com
#
# watcher-cron.bsh V02.801
#
# $1 = interval
# $2 = interations
# $3 = log directory
# $4 = days of logs to keep
# $5 = compression level (optional)
# 0 - no compression (previous default)
# 1 - end-of-batch compression (new default, about a 10:1 compression ratio observed)
# 2 - '1' + use gzip, if available, inline with top output creation
# 3 - '2' + use gzip all commands except vmstat
#
# crontab -u root -e
# 21 * * * * bash /<path>/watcher-cron.bsh 30 120 /<logdir> 2 1
#
if [ "$4" == "" ]
then
echo " "
echo "missing arguments:"
echo "------------------"
echo " arg1 = interval, e.g. 1 (in seconds)"
echo " arg2 = interations, e.g. 720 (count of samples to take)"
echo " (from cron arg1 x arg2 = 3600 seconds, 1hr)"
echo " arg3 = log path, e.g. /tmp/watcher (where to create log files)"
echo " arg4 = days, e.g. 2 (anything older than this is purged from log path)"
echo "------------------"
echo " "
exit
fi
# -p is exclusive of -x on RHEL5*
# -p is required along with -x on RHEL6* to get partitions included
# -p is required along with -x on RHEL7* to get partitions included
_pflg=""
if [[ `uname -a` == *2.6.32* || `uname -a` == *3.10.0* ]]
then
_pflg="-p"
fi
_compress_hourly=1
_compress_top=""
_compress_all=""
_ext_top=""
_ext_all=""
if [ "$5" != "" ]
then
if [ $5 -eq 0 ]
then
_compress_hourly=0
fi
if [ -e "/usr/bin/gzip" ]
then
if [ $5 -ge 2 ]
then
_compress_top=" | gzip - "
_ext_top=".gz"
fi
if [ $5 -ge 3 ]
then
_compress_all=" | gzip - "
_ext_all=".gz"
fi
fi
fi
_wait_til_complete=0
if [ $_compress_hourly -gt 0 ]
then
_wait_til_complete=1
fi
_time=$(date +%Y%m%d-%H%M%S)
_name=$(uname -a)
#echo $1 seconds x $2 intervals, at ${_time}
echo '#TIME= '${_time} > $3/${_time}-info-vmstat.log
echo '#INTERVAL= '$1 >> $3/${_time}-info-vmstat.log
echo '#COUNT= '$2 >> $3/${_time}-info-vmstat.log
echo '#UNAME= "'${_name}'"' >> $3/${_time}-info-vmstat.log
#
echo '#TIME= '${_time} > $3/${_time}-info-vmstat-d.log
echo '#INTERVAL= '$1 >> $3/${_time}-info-vmstat-d.log
echo '#COUNT= '$2 >> $3/${_time}-info-vmstat-d.log
echo '#UNAME= "'${_name}'"' >> $3/${_time}-info-vmstat-d.log
#
eval "echo '#UNAME= \"${_name}\"' ${_compress_all} > $3/${_time}-info-iostat.log${_ext_all}"
#eval "iostat $1 $2 -t -k -x ${_pflg} -n ${_compress_all} >> $3/${_time}-info-iostat.log${_ext_all} &"
eval "iostat $1 $2 -t -k -x ${_pflg} ${_compress_all} >> $3/${_time}-info-iostat.log${_ext_all} &" ; iopid=$!
vmstat $1 $2 >> $3/${_time}-info-vmstat.log & vmpid=$!
vmstat $1 $2 -d >> $3/${_time}-info-vmstat-d.log & vdpid=$!
eval "top -b -d $1 -n $2 ${_compress_top} > $3/${_time}-info-top.log${_ext_top} &" ; tppid=$!
eval "mpstat $1 $2 -P ALL ${_compress_all} > $3/${_time}-info-mpstat.log${_ext_all} &" ; mppid=$!
#
if [ $_wait_til_complete -gt 0 ]
then
sleep $(( $1 * $2 ))
fi
find $3/*-info-*.log* -mtime +$4 -exec rm -f {} \;
if [ $_compress_hourly -gt 0 ]
then
while sleep 1; do ps -o pid= -p $iopid || break; done
while sleep 1; do ps -o pid= -p $vmpid || break; done
while sleep 1; do ps -o pid= -p $vdpid || break; done
while sleep 1; do ps -o pid= -p $tppid || break; done
while sleep 1; do ps -o pid= -p $mppid || break; done
ls -lh $3/${_time}-info*.log* > $3/${_time}-info-files.log
ls -1c $3 | grep ${_time}-info | grep log > $3/${_time}-files.txt
tar -C $3 -T $3/${_time}-files.txt --remove-files -czf $3/${_time}-info-watcher.tar.gz
find $3/*-info-watcher.tar.gz -mtime +$4 -exec rm -f {} \;
rm -f $3/${_time}-files.txt
fi
- Add the executable flag to the script
# chmod +x /path/to/watcher-cron.bsh
- Note: replace
/path/towith the script's path in your environment.
- Create a new crontab entry for your root user to call the
watcher-cron.bshscript:
# crontab -u root -e
21 * * * * bash /path/to/watcher-cron.bsh 10 361 <logdir> 4 1
- Note: replace
/path/towith the script's path in your environment.- Note: replace
<logdir>with the directory where you would like to store the log files.- It is recommended that both the script and log directory not be located on any storage with the suspected problem.
- Your support technician may ask you to adjust the input variables on the command line.
- Allow the script to run for some time so the diagnostic information can be gathered. You can then tar up the data by running the following command:
# tar -cvjf /tmp/watcher-cron.tar.bz2 <logdir>
- Post the tar'd files to the case along with any specific times when performance issues were noticed.
Root Cause
The watcher-cron.bsh is a set of commands that has evolved over time which collect finer granularity information on the system vis-a-vis a time-wise fashion than other methods (like cron-based sar).
The script uses standard utilities which are nominally already installed on Linux systems avoiding change management requests.
Performance issues may be constant and on-going, or may be noticed or seen on a periodic basis. In both cases, system background information should be gathered as a baseline before taking any action. This allows us to see if changes have had a measurable impact and whether the impact was positive or negative (or none at all).
Diagnostic Steps
This is a passive data collection process, simply collecting the data from the running system. If we believe we understand what triggers the performance issue then we could do an active data collection, namely introduce a program or script that generates a point load simulation of what has been identified as triggering the issue and gathering data to show the simulation/point load is causing similar behavior.
The size of the files created varies by number of storage devices present, system activity, and number of seconds within each sample and how long watcher is run. However, with ~35 LUNs, a moderate amount of production activity, a 10 second sample size and a nominal run of 1 hour the total file of all the log files is ~30-50 Megabytes per hour of logging. This is just a ball-park figure, your system may generate significantly larger files. With the default compression of files at the end of each watcher run, a compressed file of hourly logs will be in the 5-8 Megabyte range.
If your top file, the largest file collected, is on the large size, then adding a final argument (argument number 5) of 2 to the command line will cause top command output to be compressed as its collected. This can reduce the size of top's log file footprint on the system. Typically with in-line compression the size of the file is reduced 10 fold. The down side is that compressed data isn't output after each sample but at some later time. So there can be data loss if there is a sudden shutdown or panic. Still, using compression option 2 or even 3 can reduce the amount of IO to disk during data collection at a slight increase of CPU usage.