[Troubleshooting] Understanding iostat -x output
- What does iostat -x do?
- What do the diffierent iostat values mean?
- How does one read and understand the output provided by iostat -x command?
- What is the meaning of all attributes having found in the out put of "iostat" command?
- How can total iops (io commands per second) be calculated or determined using iostat output?
- Do iostat rsec/s and wsec/s output columns means read IOPS and write IOPS, and is the combination of those equal to total IOPS?
- Which fields within /proc/diskstats are used for which columns of iostat -x?
- Where does iostat get its data from?
- What package needs to be installed to supply the iostat command?
Issue
- What does iostat -x do?
- What do the diffierent iostat values mean?
- How does one read and understand the output provided by iostat -x command?
- What is the meaning of all attributes having found in the out put of "iostat -x" command?
- How can total iops (io commands per second) be calculated or determined using iostat output?
- Do iostat rsec/s and wsec/s output columns means read IOPS and write IOPS, and is the combination of those equal to total IOPS?
- Which fields within /proc/diskstats are used for which columns of iostat -x?
- Where does iostat get its data from?
- What package needs to be installed to supply the iostat command?
Environment
- Red Hat Enterprise Linux (RHEL) 5-10
Resolution
- What does iostat -x do?
- Which fields within /proc/diskstats are used for which columns of iostat -x?
- What does iostat output look like, what are the different fields?
- What do the different columns of iostat -x output mean?
- How can total iops (io commands per second) be calculated or determined using iostat output?
- What package provides iostat?
##Summary: What does iostat -x do?##
Adding the -x flag to iostat will output extended block device I/O statistics. For example instead of just tps (I/O per second), the extended statistics outputs separate r/s (reads per second) and w/s (writes per second) as separate columns of data, etc. In general the extended block device statistics are more useful for understanding and monitoring the I/O load on the system.
The iostat utility reads /proc/diskstats and computes the difference between two samples taken at the specified interval, for example every 5 seconds. Since many of the columns are expressed as quantity per second, most calculated differences between now and the last sample time are divided by the sample interval time to generate the reported output of value per second.
For example if the number of completed reads since last sample time is 150 and 200 in the current sample with a sample time is 2 seconds, then r/s, reads per second, is (200-150)/2 = 50/2 or 25.00 reads per second.
When /proc/diskstats is first read, there is no previous sample to compare against. In this case, iostat performs a comparision against an all zeros diskstats line for each device and uses the system's uptime as the interval. So the first output from iostat is an average from boot time until now. The following samples will then be at whatever interval is specified on the iostat command line. For this reason, iostat -x output is not very useful as in only reports one set of data and stops: boot until now averages. At a minimum, something like iostat -x 5 1 -y should be used as this will skip the boot until now output (-y flag) and then output one sample averaged over the last 5 seconds.
See the portions of iostat.c code that read in the data from /proc/diskstats and format/output the calculated values for additional details.
## Data Source: Which fields within /proc/diskstats are used for which columns of iostat -x? ##
The common data source for iostat, sar, collectl, PCP (Performance Co-Pilot), and similar tools that gather and report disk statistics in /proc/diskstats.
The /proc/diskstats has the format shown below across all RHEL versions 5-8. The bolded fields are fields added starting
in RHEL 8. The individual field descriptions for iostat output include references to the which fields are used by the calculation for that column.
$ cat /proc/diskstats
:
.
+------------------------------------------------------------------ # 1: reads : total number of reads completed r/s --+
| +------------------------------------------------------------ # 2: reads : total number of read merges rrqm/s |
| | +----------------------------------------------------------- # 3: reads : total sectors read rkB/s --+--> avrq-sz
| | | +-------------------------------------------------- # 4: reads : total ms spent doing reads r_await |
| | | | +------------------------------------------- # 5: writes: total number of writes completed w/s --+
| | | | | +------------------------------------ # 6: writes: total number of write merges wrqm/s |
| | | | | | +---------------------------------- # 7: writes: total sectors written wkB/s --+
| | | | | | | +------------------------- # 8: writes: total ms spent doing writes w_await
| | | | | | | | +------------------ # 9: in-progress counter, 0 at idle
| | | | | | | | | +---------------- #10: total ms spent doing I/O (includes in-progress io) %util
| | | | | | | | | | +--------- #11: weighted total count, avgqu-sz
Maj Minor Dev 1 2 3 4 5 6 7 8 9 10 11
8 0 sda 59862 0 25918681 116996 390632 0 27158537 257702 0 314866 374690 0 0 0 0
A B C 12 13 14 15
| | | | | | +-- #12: total successful discards
| | +----- Device (disk) name | | +----- #13: merged discards
| +--------- Device minor number | +-------- #14: total successful sectors discarded
+----------------- Device major number +----------- #15: total ms spent doing discards
1 - 15 field numbers defined within /usr/share/doc/kernel*/Documentation/iostats.txt file.
This requires that the kernel-doc package be installed.
The only field not called out within the above is svctm. The svctm is calcuated as %util * 1000ms / (r/s + w/s), so involves fields #10, #1, and #5.
##Output: Examples of iostat -x output##
- The disk statistics output portion of iostat -x output looks like the following samples:
RHEL10 {iostat -V => sysstat version 12.7.6}
[root@hostname ~]# iostat -x 5| egrep 'vda |Device' Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util vda 7.38 222.29 1.92 20.69 1.44 30.14 1.68 27.96 3.17 65.35 1.91 16.63 0.77 63458.36 0.00 0.00 0.04 81983.60 0.24 0.53 0.01 0.71 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util vda 0.00 0.00 0.00 0.00 0.00 0.00 0.20 0.80 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02
RHEL9 {iostat -V => sysstat version 12.5.4}[1]
[root@hostname ~]# iostat -x 5| egrep 'sda |sdb |Device' Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 0.62 0.17 143.60 0.00 0.04 0.18 5.67 6.82 15.28 0.01 48.38 233.32 1.15 0.07 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.52 0.00 0.00 34.81 0.00 0.94 0.00Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
RHEL 9 out of iostat is same as RHEL 8
RHEL8 {iostat -V => sysstat version 11.7.3}[1]
[root@hostname ~]# iostat -x 5| egrep 'sda |sdb |Device' Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 0.62 0.17 143.60 0.00 0.04 0.18 5.67 6.82 15.28 0.01 48.38 233.32 1.15 0.07 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.52 0.00 0.00 34.81 0.00 0.94 0.00Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
RHEL 8 iostat notes:
- [1] In RHEL 8, a number of changes in the output of iostat were made including:
- some columns appear in a different order/location than previous versions,
- added the columns underlined within the above sample,
- renamed 'avgqusz' to `aqu-sz`,
- dropped the combined `await` column and `avgrq` columns as these are now covered by individual columns for read and write,
- the default for data per second fields is kB/s instead of sectors as in previous releases.
- the default unit for average request sizes is now kB (kilobytes), in previous releases this was sectors.
- dropped the ':' at the end of `Device` column name.
RHEL 7 {iostat -V => sysstat version 10.1.5} ::
[root@hostname ~]# iostat -x 5| egrep 'sdaa|sdau|Device' Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await[1] svctm %util sdaa 0.00 0.00 0.00 0.00 0.00 0.00 8.11 0.00 5.50 5.50 0.00 5.29 0.00 sdau 0.74 0.00 0.02 0.00 3.11 0.00 254.08 0.00 2.36 2.36 0.00 1.24 0.00Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await[1] svctm %util
sdaa 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdau 0.00 7161.00 231.00 231.00 29568.00 29568.00 256.00 0.99 2.15 1.91 2.39 2.15 99.40
RHEL 7 iostat notes:
- [1] While
/proc/diskstatshas included the necessary fields for calculating separate read and write await time since at least RHEL 5, iostat prior to version 10 only calculated the combined read+write await time. With the release of sysstat version 10 in RHEL 7, iostat now calculates and presents all three variants of await time; - await : a combined read+write await time (the same as prior versions of iostat),
- r_await : read await time, and
- w_await : write await time.
- Since the separate fields within
/proc/diskstatshave existed since at least RHEL 5, its possible to recompile version 10 sysstat sources to get the separate await fields in RHEL 5 and RHEL 6 if so desired. Doing so is left as an exercise for the reader. Version 10 sysstat is not supplied nor supported by Red Hat on versions prior to RHEL 7.
RHEL 6 {iostat -V => sysstat version 9.0.4}, and RHEL 5 {iostat -V => sysstat version 7.0.2}::
[root@hostname ~]# iostat -x 5| egrep 'sdaa|sdau|Device' Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdaa 0.48 0.00 0.04 0.00 8.12 0.00 203.83 0.00 4.36 1.95 0.01 sdau 8.41 73.92 17.25 4.32 4418.60 1440.98 271.57 0.03 1.62 0.72 1.56Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdaa 292.79 16.82 250.49 24.50 69175.98 3081.04 262.76 5.17 18.79 0.73 20.06
sdau 73159.00 3.00 2361.00 1.00 604160.00 32.00 255.80 3.01 1.28 0.42 100.10
##Fields: Iostat Field Definitions & Notes##
See the output of man iostat for a general overview of the fields.
While most of the fields are fairly self explanatory, other fields can benefit from addition context and explanation. Each of the fields are described below, along with additional context where appropriate.
Within the following discussion, the value of the field being referenced is the difference, or delta, between the current sample of /proc/diskstats and the previous sample values.
For example, if the previous sample had a value within field #1 (total reads by the system since boot) of 150, and the current sample one second later has a value of 200 -- then field #1's value for the purpose of the discussion is 50 -- the difference between the two samples. That is 50 reads have completed during the current sample time.
As previously stated, many of the fields are straight forward, like r/s (reads per second) and similar. These are just the delta difference for the referenced /proc/diskstats field divided by the interval time. For these files, no additional information or quoted code from iostat is necessary and thus not included. The code for calculating r/s within iostat.c is the S_VALUE() macro. This macro is utilized for a significant number of the fields that are output within the extended statistics report.
S_VALUE(ioj->rd_ios, ioi->rd_ios, itv)
where ioi :: the current data from /proc/diskstats
ioj :: the previous data from /proc/diskstats
rd_ios :: the contents of field #1
itv :: interval in 1/100 second increments
#define S_VALUE(m,n,p) (((double) ((n) - (m))) / (p) * 100)
or
valuedT = (valuenow - valuelast)
display-value-per-second = valuedT / (interval-in-hundreths-of-a-second * 100)
in the case of reads, valuedT = total-readsdT, and dT = difference of change over time
For example, 150 reads first sample, 200 reads next sample 2 seconds later:
25.00 r/s = (200 - 150) / 2
Many of the extended statistic fields represent data that is measured at I/O completion time -- back end of the I/O stack; r/s, w/s, rkB/s, wkB/s, avgrq-sz, %rrqm, %wrqm, await, r_await, w_await, svctm. Others are measured at the front end of the scheduler; rrqm/s wrqm/s. And a final group that is measured more or less continuously -- every time an I/O is submitted, completed, or /proc/diskstats read; avgqu-sz, %util.
Note that all code samples listed under the field explanations are from RHEL 8 unless otherwise specified. The following are the standard headers output within each of the different RHEL versions when the -x option is specified.
The iostat header order when -x option used:
RHEL8: r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util RHEL7: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util RHEL6: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util RHEL5: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
For the purposes of this discussion, the highlighted numbers within following sample are reference within some of the field descriptions below.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdaa 292.79 16.82 250.49 24.50 69175.98 3081.04 262.76 5.17 18.79 0.73 20.06 sdau 73159.00 3.00 2361.00 1.00 604160.00 32.00 255.80 3.01 1.28 0.42 100.10RHEL 8 sample Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 0.62 0.18 143.81 0.00 0.04 0.64 5.92 5.98 15.29 0.01 39.62 233.03 1.15 0.07
IOSTAT Fields
-
rrqm/s = 292.79 (
/proc/diskstatsfield: #2) ↩- The number of read requests merged per second that were queue to the I/O scheduler for the device.
- Code:
S_VALUE(ioj->rd_merges, ioi->rd_merges, itv) - rrqm/s = read-mergesdT / interval-in-seconds
- This rrqm/s plus r/s i.e. 'rrqm/s+r/s' is the total number of reads submitted to the I/O scheduler for this device.
- Merges are counted as they happen (at I/O submit time), while r/s are measured when the I/O completes.
- No or few read merging is often an indicator of a) random I/O or b) direct/synchronous I/O or a combination of both. Especially if the average read request size is on the small side (<16kb).
-
wrqm/s = 16.82 (
/proc/diskstatsfield: #6) ↩- The number of write requests merged per second that were queue to the I/O scheduler for the device.
- Code:
S_VALUE(ioj->wr_merges, ioi->wr_merges, itv) - wrqm/s = write-mergesdT / interval-in-seconds
- This wrqm/s plus w/s i.e. 'wrqm/s+w/s' is the total number of writes submitted to the I/O scheduler for this device.
- Merges are counted as they happen (at I/O submit time), while w/s are measured when the I/O completes.
- No or few write merging is often an indicator of a) random I/O or b) direct/synchronous I/O or a combination of both. Especially if the average write request size is on the small side (<16kb).
-
%rrqm = 0.64 (
/proc/diskstatsfields: #1, #2) ↩- The percentage of read requests merged together before being sent to the device as compared to completed reads.
- Code:
xios.rrqm_pc = (ioi->rd_merges - ioj->rd_merges) + (ioi->rd_ios - ioj->rd_ios) ? (double) ((ioi->rd_merges - ioj->rd_merges)) / ((ioi->rd_merges - ioj->rd_merges) + (ioi->rd_ios - ioj->rd_ios)) * 100 : 0.0;- %rrqm = (read-mergesdT / ( read-mergesdT + readsdT) * 100 (or 0.00 if no reads or merges)
- This field is RHEL 8 and later only.
-
%wrqm = 5.92 (
/proc/diskstatsfields: #5, #6) ↩-
The percentage of write requests merged together before being sent to the device as compared to completed writes.
-
Code:
xios.wrqm_pc = (ioi->wr_merges - ioj->wr_merges) + (ioi->wr_ios - ioj->wr_ios) ? (double) ((ioi->wr_merges - ioj->wr_merges)) / ((ioi->wr_merges - ioj->wr_merges) + (ioi->wr_ios - ioj->wr_ios)) * 100 : 0.0; -
%wrqm = (write-mergesdT / ( write-mergesdT + writesdT) * 100 (or 0.00 if no writes or merges)
-
This field is RHEL 8 and later only.
-
-
r/s = 250.49 (
/proc/diskstatsfield: #1) ↩- The number of read requests, after merges, that were completed by the device per second.
- Code:
S_VALUE(ioj->rd_ios, ioi->rd_ios, itv) - r/s = reads-completeddT / interval-in-seconds
- Note that
( rrqm/s + r/s )= number of read requests per second issued to the I/O scheduler for this device. The r/s is the number of I/O commands after merging that were issued and completed by the device. - IOPS, or I/O per seconds, can be caclulated as a sum of
( r/s + w/s ). If usingiostatwithout the-xoption, the IOPS value is output under thetps(transactions per second) column. - This number is only increased when an I/O actually completes.
- If there are multiple samples of neither read or write I/O completions, but there is I/O present as evidenced by %util being high -- typically close to 100% -- and avgqu-sz non-zero, then this is nominally an indicator of a stalled disk. Typically when this happens, the await time jumps to a very large value when the stall condition within storage resolves itself.
-
rsec/s = 69175.98 (
/proc/diskstatsfield: #3) ↩- The number of sectors (-k => kilobytes (KiB), -m => megabytes (MiB)) read from the device per second.
- Code:
S_VALUE(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr, where fctr is the conversion factor from sectors to desired display units (2 for kb, for example) - rsec/s =(read-sectorsdT / interval-in-seconds) / 1
- rkB/s =(read-sectorsdT / interval-in-seconds) / 2 ; this is the default display units for RHEL 8 (KiB), in RHEL 6,7 use the -k flag
- The number of read sectors is only increased when an I/O actually completes.
- There are iostat options that can be used to convert this field from
rsec/sinto other units. For example, the -k option to iostat will convert these values to kilobytes (KiB) instead of sectors. There is no option to convert to sectors/second in RHEL 8 from the default of kB/s. - Typically output in KiB is preferred as output in MiB can end up hiding smaller amounts of I/O bandwidth due to precision of mB/s numbers (e.g. 0.00 mB/s could mean the IO rate was as 4kB/s rather than none as represented tby 0.00).
- The throughput (bytes per second moved between host and storage) of the device is measured as the sum of
( rsec/s + wsec/s ).
-
w/s = 24.50 (
/proc/diskstatsfield: #5) ↩- The number of write requests, after merges, that were completed by the device per second
- Code:
S_VALUE(ioj->wr_ios, ioi->wr_ios, itv) - w/s = writes-completeddT / interval-in-seconds
- Note that
( wrqm/s + w/s )= number of write requests per second issued into the I/O scheduler for this device. The w/s is the number of I/O commands after merging that were issued and completed by the device. - IOPS, or I/O per seconds, can be caclulated as a sum of
( r/s + w/s ). If usingiostatwithout the-xoption, the IOPS value is output under thetps(transactions per second) column. - This number is only increased when an I/O actually completes.
- If there are multiple samples of neither read or write I/O completions, but there is I/O present as evidenced by %util being high -- typically close to 100% -- and avgqu-sz non-zero, then this is nominally an indicator of a stalled disk. Typically when this happens, the await time jumps to a very large value when the stall condition within storage resolves itself.
-
wsec/s = 3081.04 (
/proc/diskstatsfield: #7) ↩- The number of sectors (-k => kilobytes(KiB), -m => megabytes (MiB)) written to the device per second.
S_VALUE(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr, where fctr is the conversion factor from sectors to desired display units (2 for kb, for example)- wsec/s =(written-sectorsdT / interval-in-seconds) / 1
- wkB/s =(written-sectorsdT / interval-in-seconds) / 2 ; this is the default display units for RHEL 8 (KiB), in RHEL 6,7 using the -k flag
- The number of written sectors is only increased when an I/O actually completes.
- There are iostat options that can be used to convert this field from
wsec/sinto other units. For example, the -k option to iostat will convert these values to kilobytes (KiB) instead of sectors. There is no option to convert to sectors/second in RHEL 8 from the default of kB/s. - Typically output in KiB is preferred as output in MiB can end up hiding smaller amounts of I/O bandwidth due to precision of mB/s numbers (e.g. 0.00 mB/s could mean the IO rate was as 4kB/s rather than none as represented tby 0.00).
- The throughput (bytes per second moved between host and storage) of the device is measured as the sum of
( rsec/s + wsec/s ).
-
avgrq-sz = 262.76 (
/proc/diskstatsfields: #1, #3, #5, #7) ↩- The average size, in sectors, of the requests that were completed by the device.
- Code:
xds->arqsz = (sdc->nr_ios - sdp->nr_ios) ? ((sdc->rd_sect - sdp->rd_sect) + (sdc->wr_sect - sdp->wr_sect)) / ((double) (sdc->nr_ios - sdp->nr_ios)) : 0.0;- avrq-sz = ( read-sectorsdT + written-sectorsdT ) / ( total-readsdT + total-writesdT)
- 262.76 = (69175.98+ 3081.04) / (250.49+24.50) within the highlighted sample line found at the beginning of this session.
- 262.76 = (69175.98+ 3081.04) / (250.49+24.50) within the highlighted sample line found at the beginning of this session.
- This code is from RHEL 7. This field is split into separate columns of
rareq-szandwareq-szin RHEL 8 and is output in KiB therein. - This field is only updated when I/O actually complete.
- This field is always displayed in sectors; neither
-kor-mwill affect the displayed units used by this column.
-
rareq-sz = 39.62 (
/proc/diskstatsfields: #1, #3) ↩-
The average size, in kilobytes (KiB), of the read requests that were issued to the device.
-
This code for this is in two parts, calculating the average in sectors then converting to kilobytes.
xios.rarqsz = (ioi->rd_ios - ioj->rd_ios) ? (ioi->rd_sectors - ioj->rd_sectors) / ((double) (ioi->rd_ios - ioj->rd_ios)) : 0.0; /* rareq-sz wareq-sz (in kB, not sectors) */ cprintf_f(DISPLAY_UNIT(flags) ? UNIT_KILOBYTE : NO_UNIT, 2, 8, 2, xios->rarqsz / 2, xios->warqsz / 2); -
rareq-sz = ( sectors-readdT / reads-completeddT ) / 2 *the divide by two to convert to kib */
-
This field is available in the iostat version within RHEL 8 only.
-
This field is only updated when I/O actually complete.
-
This field is always displayed in KiB; neither
-kor-mwill affect the displayed units used by this column.
-
-
wareq-sz = 233.03 (
/proc/diskstatsfields: #5, #7) ↩-
The average size, in kilobytes (KiB), of the write requests that were issued to the device.
-
This code for this is in two parts, calculating the average in sectors then converting to kilobytes.
xios.warqsz = (ioi->wr_ios - ioj->wr_ios) ? (ioi->wr_sectors - ioj->wr_sectors) / ((double) (ioi->wr_ios - ioj->wr_ios)) : 0.0; /* rareq-sz wareq-sz (in kB, not sectors) */ cprintf_f(DISPLAY_UNIT(flags) ? UNIT_KILOBYTE : NO_UNIT, 2, 8, 2, xios->rarqsz / 2, xios->warqsz / 2); -
wareq-sz = ( sectors-writtendT / writes-completeddT ) / 2 *the divide by two to convert to kib */
-
This field is available in the iostat version within RHEL 8 only.
-
This field is only updated when I/O actually complete.
-
This field is always displayed in KiB; neither
-kor-mwill affect the displayed units used by this column.
-
-
avgqu-sz = 5.17 (
/proc/diskstatsfield: #11) ↩-
The average queue length of the requests that were issued to the device.
-
Code:
S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0)A -
avgqu-sz =(weighted-time-in-queuedT / interval-in-seconds) / 1000-ms-per-second
- weighted-time-in-queue = number-io-present * dT-since-last-calculation; see below for more information
- weighted-time-in-queue = number-io-present * dT-since-last-calculation; see below for more information
-
In RHEL 8 and later this field was renamed to
aqu-sz. -
This field is based upon a running summation within field #11, as such its updated at when I/O start, are in progress, and when completing.
-
The average queue size includes any IO contained within the I/O scheduler's sort queue as well as any in-progress I/O within storage.
-
The nominal maximum value that
avgqu-szcan reach is(nr_requests * 2) + max-lun-queue-depth. Thenr_requestsvalue is the maximum number of requests that can be queued within the I/O scheduler. That value is applied separately to reads and writes, hence the "* 2" within the above formula. The maximum number of I/O that can be present (in-flight to storage) within the driver/disk is the specified maximum lun queue depth for that driver (or disk). For example the default lun queue depth for many fibre channel drivers is 32, combined with the default nr_request value of 128 means, in that case, the maximumavgqu_szwill be 284. Any I/O submitted above that limit will block outside of the I/O scheduler and not be included within the I/O statistics. -
When either a large amount of I/O is continuously being submitted or when I/O latency within storage occurrs, this calculated average queue length will increase. Typically the latter can be seen to happen when I/O await times climb. As opposed to the former can be seen when queue depth increases but await time remains steady.
-
Field #11 is defined as follows within /usr/share/doc/kernel-doc-*/Documentation/iostats.txt.
- "Field 11 -- weighted # of milliseconds spent doing I/Os
This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of milliseconds spent doing I/O since the last update of this field. This can provide an easy measure
of both I/O completion time and the backlog that may be accumulating." - What this means is that an accumulation of number-of-ios-prsent x dT (delta-time) is calculated and saved within field #11. Or more simply, something like the area of the following is calculated as a running summation:
^ #io | +----+ +--+ | | | | | | | +----+ | | +---+ +-----+ +---+ | | | | | +----+----------------------+---+---+--------------> time -->Every time the number of I/O present changes, a new calculation is performed as within kernel code below. The number of I/O present times a dT value which is the time difference in ticks since the last calculation. ``` if (inflight) { __part_stat_add(cpu, part, time_in_queue, << add #inflight-io x dTime to time_in_queue (field #11) inflight * (now - part->stamp)); : } part->stamp = now; ``` The number of I/O present changes when new I/O are submitted (this includes merging) and when I/O complete. The calculation is also forced when `/proc/diskstats` is read so that the current accumulated value reflects the appropriate value at that moment in time. The calculated area is what is being referred to as the "weighted sum" within field #11's description. This value in the field is N<sub>milliseconds</sub> x number-of-io which is why there is the "/ 1000.0" within the caclulation. The result is an average number of IOs present across the sample time interval.The area under the above graph represents the
ticks(delta-time) multipled by the number of I/O present at any given time. The number of I/O present changes when new I/O are started (submitted), exiting I/O end (completes), or exiting I/O are merged (n=>1). The calculation must also be forced when/proc/diskstatsis read so that the current accumulated value reflects the appropriate value for that moment in time. And the calculated area is the value that is referred to as the weighted sum within field #11. When the accumulated area of the above graph is taken from two different points in time -- the sample time -- a difference is calculated by subtracting the two quantities. This represents the change of the weighted sum over the sample time. If that weighted sum is then divided by the number oftickswithin the sample time it yields a average number of I/O present over that time period. That is, the average number of I/O that were present within the device, where device include the combined I/O scheduler, driver, and storage subsystems. That calculated value is reported average queue depth (average number of I/O present against the "device").
- "Field 11 -- weighted # of milliseconds spent doing I/Os
-
-
await = 18.79 (
/proc/diskstatsfield: #1,4 (reads) & #5,8 (writes)) ↩-
The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in I/O scheduler queue and the time spent servicing them within the storage device.
- It is recommended to read "What exactly is the meaning of value "await" reported by iostat?" for a more in-depth discussion, explanation and interpretation of the values within the await column. What follows is a brief overview of the
awaitcolumn.
- It is recommended to read "What exactly is the meaning of value "await" reported by iostat?" for a more in-depth discussion, explanation and interpretation of the values within the await column. What follows is a brief overview of the
-
Code: The following code calculates
awaittime.xds->await = (sdc->nr_ios - sdp->nr_ios) ? ((sdc->rd_ticks - sdp->rd_ticks) + (sdc->wr_ticks - sdp->wr_ticks)) / ((double) (sdc->nr_ios - sdp->nr_ios)) : 0.0; -
await = (time-spent-performing-readsdT + time-spent-performig-writesdT) / (reads-performeddT + writes-performeddT)
-
Await is calculated by adding fields #4and #8 divided by the sum of fields #1 and #5. The total number of milliseconds spent performing reads+writes divided by the total number of read+writes that completed:
- await = (total time to complete io in ms) / (total number of io that completed)
- await = (total time to complete io in ms) / (total number of io that completed)
-
This field is a combination of both read and write await time merged into one value.
-
This field is not output within RHEL 8. Instead just the separate read await and write await columns are output.
-
This value is measured on a per I/O basis from the point at which the I/O is submitted into the scheduler until I/O done time. Await is calculated by adding fields #4 and #8, the total time performing reads and writes respectively, divided by the sum of fields #1 and #5, total number of read and write I/O respectively. Or, the total number of milliseconds spent performing reads+writes divided by the total number of read+writes that completed.
-
Await time will tend to spike upwards once the average queue size nears and exceeds the underlying lun queue depth of the device. This happens because I/O will then start spending significant amounts of time sitting within the I/O scheduler queue awaiting dispatch to the underlying disk. Furthermore, I/O can not be dispatched to a disk whose queue is already full.
- The underlying LUN queue can become full for two reasons. These may happen separately or together.
- Increased amounts of I/O are issued to the device. Nominally this is a direct or indirect result of the set of applications running on the system. Very little I/O is "manufactured" within the kernel except for things like I/O associated with swapping and filesystem metadata and journaling writes associated with applications interacting with files.
- Increased load on the device results in storage latency issues. This most often occurs when the host is a virtual guest -- the physical resources of the system are shared with a number of guests and overlapping/increased loads from multiple guests can generate storage latency issues within one or more guests. The reason - increased load on a physical resource - is hidden from the individual guest. The other most common cause is similar, but with shared storage. For example FC SAN storage. While the disks themselves are not necessarily shared, the storage controller ports and I/O controllers are often shared across multiple physical systems. An increase in I/O on one or more of these storage controllers can affect storage latency on an individual host, even though that host has not changed its I/O load.
- The underlying LUN queue can become full for two reasons. These may happen separately or together.
-
*While the average queue size (
avgqu-sz) remains low or at least lower than the lun queue depth of the underlying device, there is very little queueing time accumulated within the scheduler, as in this example. Therefore the await time as seen here represents near enough the average I/O completion time of storage, that is, the storage latency time. The svctm is often confused with this, but svctm is essentially a calculated "effective" I/O completion rate. The svctm field is simply calculated from( %util * 1000ms / #completed I/O per second ).- For example, using values from the highlighted sample at the start of this section,
( 20.6% * 1000ms / 274.99 ) = 0.729msper I/O which matches the number in the highlighted sample line where it is rounded up to 0.73ms. The svctm calculation is in effect taking into account parallel I/O being processed by storage vs await times more directly measured per I/O approach.
- For example, using values from the highlighted sample at the start of this section,
-
-
r_await = 1.91 {RHEL 7/version 10 or later only -
/proc/diskstatsfields: #1,#4} ↩-
The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in I/O scheduler queue and the time spent servicing them within the storage device.
- It is recommended to read "What exactly is the meaning of value "await" reported by iostat?" for a more in-depth discussion and explanation of the await column. What follows is a brief explanation of the
r_awaitcolumn.
- It is recommended to read "What exactly is the meaning of value "await" reported by iostat?" for a more in-depth discussion and explanation of the await column. What follows is a brief explanation of the
-
The following code caclulates
r_awaittime.xios.r_await = (ioi->rd_ios - ioj->rd_ios) ? (ioi->rd_ticks - ioj->rd_ticks) / ((double) (ioi->rd_ios - ioj->rd_ios)) : 0.0; -
r_await = time-spent-performing-readsdT / reads-performeddT
-
Having the separate read vs write await time can help show if any potential storage latency issue is read or write sensitive.
-
This value is measured on a per I/O basis from the point at which the I/O is submitted into the scheduler until I/O done time.
-
Await time will tend to spike upwards once the average queue size nears and exceeds the underlying lun queue depth of the device. This happens because I/O will then start spending significant amounts of time sitting within the I/O scheduler queue awaiting dispatch to the underlying disk. Furthermore, I/O can not be dispatched to a disk whose queue is already full.
-
*While the average queue size (
avgqu-sz) remains low or at least lower than the lun queue depth of the underlying device, there is very little queueing time accumulated within the scheduler, as in this example. Therefore the await time as seen here represents near enough the average I/O completion time of storage, that is, the storage latency time.
-
-
w_await = 2.39 {RHEL 7/version 10 or later only
/proc/diskstatsfields: #5, #8} ↩-
The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in I/O scheduler queue and the time spent servicing them within the storage device.
- It is recommended to read "What exactly is the meaning of value "await" reported by iostat?" for a more in-depth discussion and explanation of the await column. What follows is a brief explanation of the
w_awaitcolumn.
- It is recommended to read "What exactly is the meaning of value "await" reported by iostat?" for a more in-depth discussion and explanation of the await column. What follows is a brief explanation of the
-
The following code caclulates
w_awaittime.w_await = (ioi->wr_ios - ioj->wr_ios) ? (ioi->wr_ticks - ioj->wr_ticks) / ((double) (ioi->wr_ios - ioj->wr_ios)) : 0.0; -
w_await = time-spent-performing-writesdT / writes-performeddT
-
Having the separate read vs write await time can help show if any potential IO latency issue is read or write sensitive.
-
This value is measured on a per I/O basis from the point at which the I/O is submitted into the scheduler until I/O done time.
-
Await time will tend to spike upwards once the average queue size nears and exceeds the underlying lun queue depth of the device. This happens because I/O will then start spending significant amounts of time sitting within the I/O scheduler queue awaiting dispatch to the underlying disk. Further I/O can not be dispatched to a disk whose queue is already full.
-
*While the average queue size (
avgqu-sz) remains low or at least lower than the lun queue depth of the underlying device, there is very little queueing time accumulated within the scheduler, as in this example. Therefore the await time as seen here represents near enough the average I/O completion time of storage, that is, the storage latency time.
-
-
svctm = 0.73 {
/proc/diskstatsfields: #1, #5, #10} ↩-
The effective average service time, in milliseconds, for completed I/O requests that were issued to the device.
- See "Is the svctm's output of iostat unreliable as described within the man page? " for a more detailed discussion of svctm, and yes it is not wholly reliable but that's more related to what we think of as "service time" from a disk hardware perspective. Just remember this is a pure calculation within iostat. Typically await values are much more useful to focus on.
-
The following code shows how svctm is calculated:
sdc.nr_ios = ioi->rd_ios + ioi->wr_ios; sdp.nr_ios = ioj->rd_ios + ioj->wr_ios; tput = ((double) (sdc->nr_ios - sdp->nr_ios)) * 100 / itv; [1] xds->util = S_VALUE(sdp->tot_ticks, sdc->tot_ticks, itv); [2] xds->svctm = tput ? xds->util / tput : 0.0; -
svctm = ( total-ms-performing-iodT / interval ) number-of-io-performeddT
- [1] itv is in hudreths of a second, so with 1 second sample interval, itv=100, with 2 second, itv=200.
- [2] tot_ticks is field #10, total ms spent doing I/O (includes in-progress I/O).
-
The
svctmis not a measure of time from an I/O perspective, that is storage latency, but from a wall clock perspective. -
This is different from the read and write await fields which are calculated on measured time to complete I/O. The
svctmis just percentage of sample time when any I/O was present (%util) divided by the number of I/O that completed during that period of the interval sample.- For example, 1 I/O is issued and is oustanding for 10ms at which time it completes:
- await = 10ms, the time it took the I/O to complete.
- svctm = 10ms, the time used within the current sample during which I/O was present divided by 1.
- For example, 100 simultaneous I/O (in parallel) are issued, all are oustanding for 10ms at which time they all complete together/simultaneously:
- await = 10ms, the time it took each I/O to complete.
- svctm = 0.1ms, the time used within the current sample during which I/O was present (10ms) divided by 100 I/O.
-
This value, as seen above, is just a calculation based upon busy time of the device (%util) and number of I/O completed within the time. This is no relationship in any form to published disk service time specifications.
-
For example, from the highlight sample:
svctm = ( %util * 1000ms ) / ( r/s + w/s )device "busy" time / # of completed io
0.729ms = ( 20.6% * 1000ms ) / ( 250.49+24.50)which is the highlighted sample line value for svctm, displayed as the rounded up value of 0.73.
-
The difference between await and svctm is that await directly measures the amount of time taken for each I/O and calculates an average whereas svctm is an indirect calculation based upon %util (amount of sample interval any I/O was present) and the number of completed I/O per second. In essence you can think of svctm as taking into account parallel I/O operations whereas await does not. The await time is the directly measured time it has taken each I/O to complete, whereas svctm is just a rough calculation of effective io completion time.
-
-
%util = 20.06 (
/proc/diskstatsfield: #10) ↩-
Percentage of elapsed time (wall clock time) during a sample interval during which I/O requests were issued to the device. Device %busy time.
- Example: if field #10 has a difference of 1200ms between the start and end of a 5 second interval, then %util is calculated as 24% (1200ms/5000ms).
- As %util approaches 100% does this represent device saturation?
- The most common answer is "No" as %util is just a measure of wall clock time as will be seen below -- it is just a measure of device %busy time.
- Specifically %util is simply a measure of what percentage of the interval sample that there were any I/O present on the device. Any other interpretation of %util other than as a device %busy time is based upon one or more assumptions that would need to be carefully documented and explained as to when and why that interpretation of wall clock time means something more.
- See "What exactly is the meaning of value "%util" reported by iostat?" for a more in-depth discussion and explanation of the %util column.
* By way of an example, a 50+ disk RAID6 logical device is created within FC SAN storage and presented to the host. It is attached as sdzz. An application is running that sends I/O to sdzz. The iostat output shows %util=100% and avgqu-sz=1 (one I/O is submitted at a time). Under this type of I/O load with that hardware configuration only 1 of the 50+ physical disks will have any I/O to perform at any given time. The other 49+ disks will be idle. So at most 1/50 or 2% of the total performance limit/capacity of the logical device needed to create device saturation is present. This is akin to many SSD and NVMe devices which can have 8-32 internal channels with each channel able of processing I/O in parallel with, and independently from, other channels. This is one of the main reasons that it is generally not true that a value of %util ~= 100% represents device saturation. The interpretation as to how much device capacity is being used will change depending on both the current I/O load profile as well as the type of storage, configuration of that storage, etc. And in many of those cases the internal details of the number and type of physical disks that back the logical device that is presented to the host will be unknown from a host's perspective.
-
The code within iostat for calculating %util:
xds->util = S_VALUE(sdp->tot_ticks, sdc->tot_ticks, itv); [1] /* * %util * Again: Ticks in milliseconds. * In the case of a device group (option -g), shi->used is the number of * devices in the group. Else shi->used equals 1. */ cprintf_pc(DISPLAY_UNIT(flags), 1, 6, 2, shi->used ? xds->util / 10.0 / (double) shi->used [2] : xds->util / 10.0); /* shi->used should never be zero here */ -
%util = (clock-time-with-io-present-on-devicedT / interval) / 10
- [1] tot_ticks is field #10, total ms spent doing I/O (includes in-progress io). The calculation at [1] above determines the amount of clock time where there was at least one io present on the device, then divides that by the interval to get a relative value per second. Note that Field #10 is different from the read and write fields #4 and #8. While fields #4 and #8 accumulate time per io completed. But field #10 does not measure time from a per I/O perspective but from a clock perspective (see example below). For example, one I/O oustanding for the last 10ms means field #10 will be increased by 10ms. If there are one hundred I/O present in parallel (avgqu-sz=100), then again field #10 is only increased by 10ms, not 100x10ms. The 10ms is the measured wall clock time any I/O (1 or 100) was present against the device. But for fields #4 and #8 the added values to the field would eventually include 10ms and 1000ms respectively.
- [2] The calculation in [1] yields number of ms / sample interval that there was any I/O present. The "/ 10.0" converts down from 1000ms /second to percentage. If the calculated value was 1000ms -- I/O present throughout a one second interval -- then %util equals 100%, hence the divide 1000ms by 10 to yield the value of 100%..
-
The
%utilis just a measure of wall clock time time and calling it device %busy is a more apt description of this field. All it represents is the percentage of the sample time interval during which there was any I/O activity on the device. So if a device is 100% busy (%util=100), it just means there was at least one read or write I/O always present against the device throughout the sample interval. The avgqu-sz should be approximately 1 or more when %util is near or at 100% (its 3.01 for device sdau in the initial highlighed sample). -
during 200.6ms of each 1000ms sample interval there was at least one I/O outstanding to devicesdaa within the inital highlighted sample shown at the top of this field description section. This represents the device being busy 20.06% of the time as is listed as under %util in that output.
-
The
%utilis calculated by tracking when I/O is submitted into the schedule and when I/O completes. The counter is also updated when/proc/diskstatsis read. An internal accumulator is used to track the number of jiffies (timer ticks) during which io, any io, was present within the lower portion of the kernel's I/O stack. As long as the count of outstanding I/O within the scheduler/driver/deivce is greater than 0 (field #9 > 0), then time is accumulated towards%util. For example:
^ #io | +----+ +--+ | | | | | | | +----+ | | +---+ +-----+ +---+ | | | | | +----+----------------------+---+---+--------------> avgqusz = field #11, weighted ms performing I/O time ^ util | | | a b c d | +----------------------+ +---+ | | | | | +----+----------------------+---+---+--------------> %util = field #10, number of ms any I/O is present dTime(1) dTime(2)In the above case, the simple elapsed wall clock times in milliseconds from a->b and c->d are accumulated into the tot_ticks counter (field #10 of /proc/diskstats).
- The key piece of kernel code is below. If there are any oustanding I/O 'in_flight' (field #9), then the io_ticks accumulator has the delta time amount of
(now minus last)added to the io_ticks count. The 'last', disk->stamp, is just the last time this calculation was performed. The disk structure referenced in the following code is astruct gendisk. The 'now' time is saved into disk->stamp ('last') for next time the calculation is made. This code is called whenever an I/O is submitted, completed, or/proc/diskstatsis read.
if (disk->in_flight) { : __disk_stat_add(disk, io_ticks, (now - disk->stamp)); } disk->stamp = now;So %util is just another field with a simple delta of its value divided by the interval time. That calculation results in a number of milliseconds spent doing I/O within a single interval. If the delta of field #10 is 20ms between the beginning and end of a 1 second sample, then %util is 2.00% (20ms/1000ms = 0.02 = 2.00%) -- the device is busy just 2% of the time within this one second interval.
- Also see "How to determine how much I/O load is coming currently with the value of %util appeared by running 'iostat -x' in RHEL?" for additional information.
-
##IOPS: How can the total iops (I/O commands per second) be calculated or determined using iostat output? ##
For an individual device, adding r/s and w/s together provide the total iops for that device. Using iostat without arguments will provide a tps (transactions per second) value which is this sum.
The iostat rsec/s and wsec/s output columns are the amount of data rather than iops. The amount of data per second processed by a device is often referred to as throughput vs iops which commands processed per second. Each I/O command will typically read or write a bunch of data per command so the two counts are different in what they measure.
To determine the iops for the whole system, then on RHEL 8 you can perform iostat -g System /dev/sd\*[a-z] | egrep "Device|System" which will create a group called 'System' and sum all the sd* devices into that group within the output, as in:
# iostat -g summary /dev/sd*[a-z] | egrep "Device|summary"
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
Summary 0.66 1.29 144.43 6522876 732226949
You can use the -g option for all devices, only devices attached via a specific HBA or storage target or to group all paths into one statistic.
The -g option is only available starting in RHEL 8. For prior versions see, for example, "How to check and measure FC HBA utilization?" which discusses how to manually do this or, optionally, use an as-is program to automatically do this for related groups of disks (by HBA/scsi host, by storage target, by LUN for example).
##Package: Which package provides iostat? ##
The iostat utility is available from the sysstat package.
yum install sysstat- RHEL 8 : sysstat version 11.7.3*
- RHEL 7 : sysstat version 10.1.5*
- RHEL 6 : sysstat version 9.0.4*
- RHEL 5 : sysstat version 7..0.9*
For example, on RHEL 8:
# yum whatprovides */iostat
:
sysstat-11.7.3-2.el8.x86_64 : Collection of performance monitoring tools for Linux
Repo : rhel-8-for-x86_64-appstream-rpms
Matched from:
Filename : /usr/bin/iostat
Code: What does the iostat code look like that reads in the data from /proc/diskstats and outputs values?
Within iostat.c, /proc/diskstats values are pulled in and assigned to variables here:
iostat.c::
676 /*
677 ***************************************************************************
678 * Read stats from /proc/diskstats.
679 *
680 * IN:
681 * @curr Index in array for current sample statistics.
682 ***************************************************************************
683 */
684 void read_diskstats_stat(int curr)
685 {
:
.
705 /* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */
706 i = sscanf(line, "%u %u %s %lu %lu %lu %lu %lu %lu %lu %u %u %u %u",
707 &major, &minor, dev_name,
708 &rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
709 &wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);
The read in values are used to compute the change within each field between the two sample times taken 'N' seconds apart where 'N' is specified by the [interval] period (see man iostat). This set of delta values are used to output the disk statistics seen. The common S_VALUE macro creates a difference between the specified now and last time samples, then divides by the interval time. So, for example, if /proc/diskstats field #1, total reads performed since boot time, changed from 100 last sample to 150 this sample and the sample interval was set to 2 seconds then 25.0 reads/second would be calculated and output.
The iostat.c code outputs the values here:
RHEL 8
iostat.c::
1233 /*
1234 ***************************************************************************
1235 * Display extended stats, read from /proc/{diskstats,partitions} or /sys,
1236 * in plain or JSON format.
1237 *
1238 * IN:
1239 * @itv Interval of time.
1240 * @fctr Conversion factor.
1241 * @shi Structures describing the devices and partitions.
1242 * @ioi Current sample statistics.
1243 * @ioj Previous sample statistics.
1244 * @tab Number of tabs to print (JSON output only).
1245 ***************************************************************************
1246 */
1247 void write_ext_stat(unsigned long long itv, int fctr,
1248 struct io_hdr_stats *shi, struct io_stats *ioi,
1249 struct io_stats *ioj, int tab)
1250 {
:
1104 /* r/s w/s */
1105 cprintf_f(NO_UNIT, 2, 7, 2,
1106 S_VALUE(ioj->rd_ios, ioi->rd_ios, itv),
1107 S_VALUE(ioj->wr_ios, ioi->wr_ios, itv));
1108 /* rkB/s wkB/s */
1109 if (!DISPLAY_UNIT(flags)) {
1110 xios->rsectors /= fctr;
1111 xios->wsectors /= fctr;
1112 }
1113 cprintf_f(DISPLAY_UNIT(flags) ? UNIT_SECTOR : NO_UNIT, 2, 9, 2,
1114 xios->rsectors, xios->wsectors);
1115 /* rrqm/s wrqm/s */
1116 cprintf_f(NO_UNIT, 2, 8, 2,
1117 S_VALUE(ioj->rd_merges, ioi->rd_merges, itv),
1118 S_VALUE(ioj->wr_merges, ioi->wr_merges, itv));
1119 /* %rrqm %wrqm */
1120 cprintf_pc(DISPLAY_UNIT(flags), 2, 6, 2,
1121 xios->rrqm_pc, xios->wrqm_pc);
1122 /* r_await w_await */
1123 cprintf_f(NO_UNIT, 2, 7, 2,
1124 xios->r_await, xios->w_await);
1125 /* aqu-sz */
1126 cprintf_f(NO_UNIT, 1, 6, 2,
1127 S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0);
1128 /* rareq-sz wareq-sz (in kB, not sectors) */
1129 cprintf_f(DISPLAY_UNIT(flags) ? UNIT_KILOBYTE : NO_UNIT, 2, 8, 2,
1130 xios->rarqsz / 2, xios->warqsz / 2);
1131 /* svctm - The ticks output is biased to output 1000 ticks per second */
1132 cprintf_f(NO_UNIT, 1, 6, 2, xds->svctm);
1133 /*
1134 * %util
1135 * Again: Ticks in milliseconds.
1136 * In the case of a device group (option -g), shi->used is the number of
1137 * devices in the group. Else shi->used equals 1.
1138 */
1139 cprintf_pc(DISPLAY_UNIT(flags), 1, 6, 2,
1140 shi->used ? xds->util / 10.0 / (double) shi->used
1141 : xds->util / 10.0); /* shi->used should never be zero here */
RHEL7,6
iostat.c::
/* rrq/s wrq/s r/s w/s rsec wsec rqsz qusz await r_await w_await svctm %util */
printf(" %8.2f %8.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %7.2f %7.2f %6.2f %6.2f %d\n",
S_VALUE(ioj->rd_merges, ioi->rd_merges, itv), <== rrqm/s
S_VALUE(ioj->wr_merges, ioi->wr_merges, itv), <== wrqm/s
S_VALUE(ioj->rd_ios, ioi->rd_ios, itv), <== r/s
S_VALUE(ioj->wr_ios, ioi->wr_ios, itv), <== w/s
ll_s_value(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr, <== rsec/s
ll_s_value(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr, <== wsec/s
xds.arqsz, <== avgrq-sz
S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0, <== avgqu-sz
xds.await, <== await
r_await, <== r_await
w_await, <== w_await
/* The ticks output is biased to output 1000 ticks per second */
xds.svctm, <== svctm
/*
* Again: Ticks in milliseconds.
* In the case of a device group (option -g), shi->used is the number of
* devices in the group. Else shi->used equals 1.
*/
shi->used ? xds.util / 10.0 / (double) shi->used
: xds.util / 10.0); /* shi->used should never be null here */ <== %util
## Stalled Disk Examples ##
The following shows a partial and This content is not included.full storage hardware stall condition.
Partial Storage Hardware Stall Condition Example
In this case there was a fairly steady read I/O load against the device. The application that is reading from the device
is issuing a stream of read I/O (one at a time). So the device should always be kept busy with I/O. We can see that near the
end of the sample once the partial stall condition within the storage hardware resolves itself, the expected I/O rate of
around 1MB/s returns.
Which is good for us, because is shows stalled IOs fairly easily. The scsi spec allows up to 256 soft retries within a
device per scsi specification. For hard disks, these "soft retries" typically involve re-reads and attempts at head
offset, etc. attempting to read the data. Re-read attempts on cdroms are especially slow, so make an ideal medium to
show the profile of stalled I/O as follows:
$ iostat -tkx 1 | egrep "sda |Device"
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 60.24 0.00 1.90 0.00 144.11 0.00 151.98 0.12 65.77 65.77 0.00 34.81 6.60
sda 0.00 0.00 1.00 0.00 2.00 0.00 4.00 1.00 543.00 543.00 0.00 1000.00 100.00 [1] few/slow io
sda 0.00 0.00 2.00 0.00 4.00 0.00 4.00 1.00 515.00 515.00 0.00 500.50 100.10
sda 0.00 0.00 66.00 0.00 132.00 0.00 4.00 1.00 17.23 17.23 0.00 15.09 99.60 [2] burst of io, device catches up
sda 0.00 0.00 1.00 0.00 2.00 0.00 4.00 1.00 579.00 579.00 0.00 999.00 99.90 [3] few/slow io
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 100.00 [4] no I/O completing, but device has I/O to work
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 100.00 [5] ditto
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 100.00 [6] ditto
sda 0.00 0.00 1.00 0.00 2.00 0.00 4.00 1.00 4431.00 4431.00 0.00 1000.00 100.00 [7] stalled I/O finally completes
sda 0.00 0.00 2.00 0.00 4.00 0.00 4.00 1.00 525.00 525.00 0.00 500.00 100.00
sda 0.00 0.00 46.00 0.00 92.00 0.00 4.00 0.97 26.61 26.61 0.00 21.13 97.20
sda 0.00 0.00 306.00 0.00 612.00 0.00 4.00 0.86 2.82 2.82 0.00 2.82 86.20 [8] hardware has worked through whatever the issue was
sda 0.00 0.00 300.00 0.00 600.00 0.00 4.00 0.93 3.08 3.08 0.00 3.08 92.50
sda 0.00 0.00 300.00 0.00 600.00 0.00 4.00 0.89 2.97 2.97 0.00 2.98 89.30
sda 0.00 0.00 252.00 0.00 504.00 0.00 4.00 0.80 3.19 3.19 0.00 3.19 80.30
sda 0.00 0.00 264.00 0.00 528.00 0.00 4.00 0.90 2.33 2.33 0.00 3.40 89.80
sda 0.00 0.00 464.00 0.00 928.00 0.00 4.00 0.90 2.38 2.38 0.00 1.95 90.40
sda 0.00 0.00 655.00 0.00 1310.00 0.00 4.00 0.91 1.51 1.51 0.00 1.38 90.50
sda 0.00 0.00 887.00 0.00 1774.00 0.00 4.00 0.83 0.94 0.94 0.00 0.94 83.30
- [1] "Slow" I/O present, just 1 I/O completed in this 1 second sample. The %util indicates that at least 1 I/O was outstanding to this device the whole 1 second sample time.
- [2] A burst of reads complete.
- [3] "Slow" I/O starts again, only 2 reads complete. The svctim approaching sample time often happens in a partial stall scenario as in this sample with a svctm of 999ms in a 1s sample time.
- [4] No I/O complete -- r/s = w/s = 0 -- but both %util and avgqu-sz indicate there is 1 I/O present against this device the whole 1 second sample. Its just the storage device was unable to complete the read for some reason.
- [5] No I/O complete, but I/O has been sent to the disk. System continues to wait for I/O to complete.
- [6] Same again.
- [7] The I/O that had been sent to the device some 4 samples earlier finally completes. The I/O r_await time of 4.4s reflects the time the disk was stalled and not completing any I/O. That is I/O submitted in [3] or even earlier just completed during this sample period.
- [8] We can see from [7] onward that the device finally starts completing more and more read I/O as r/s column shows. Also, the r_await time (storage latency) continues to drop and by [8] is at normal levels for this device -- around 2ms per read. From this point forward, the device is operating as expected. However, this behavior was repeatable and cyclical and happened multiple times, always with the same above footprint.
Full Storage Hardware Stall Condition Example
In this case there was a known firmware issue within a backplane raid controller. Under certain configuration settings
a full hardware stall condition would happen. This made it easy to reproduce this footprint for documentation purposes.
This type of full hardware stall can happen, for example, when the storage hardware has to rebuild a raid and the configuration
settings indicate the rebuild takes precendense over I/O requests from the host.
So in this case we started a couple simple dd commands to read from the raw device. The stall only occurs when there is
parallel (simultaneous) I/O present and the easiest way to accomplish this was to invoke multiple dd processes. These
processes just perform sequential reads from the disk. The stall happens with more predictability under the above circumstances.
$ ps aux | grep dd
root 9450 0.0 0.0 105176 672 pts/0 D 14:28 0:00 dd if=/dev/zero of=/dev/sde bs=8k oflag=direct
root 9456 0.0 0.0 105176 672 pts/0 D 14:28 0:00 dd if=/dev/zero of=/dev/sde bs=8k oflag=direct
root 9457 0.0 0.0 105176 668 pts/0 D 14:28 0:00 dd if=/dev/zero of=/dev/sde bs=8k oflag=direct
root 9458 0.0 0.0 105176 672 pts/0 D 14:28 0:00 dd if=/dev/zero of=/dev/sde bs=8k oflag=direct
root 9459 0.0 0.0 105176 672 pts/0 D 14:28 0:00 dd if=/dev/zero of=/dev/sde bs=8k oflag=direct
root 9466 2.3 0.0 105700 1172 pts/0 D 14:28 0:05 dd if=/dev/zero of=/dev/sde bs=512k
So with the dd commands running and performing reads from the disk, run iostat and wait until the stall happens.
$ iostat -tkx 1 | grep sde
sde 0.00 0.00 0.00 107.00 0.00 856.00 16.00 174.00 43.65 9.35 100.00
sde 0.00 0.00 0.00 115.00 0.00 920.00 16.00 174.00 47.58 8.70 100.00
sde 0.00 0.00 0.00 116.00 0.00 928.00 16.00 173.99 43.09 8.62 100.00
sde 0.00 0.00 0.00 115.00 0.00 920.00 16.00 173.99 43.77 8.70 100.00
sde 0.00 0.00 0.00 114.00 0.00 912.00 16.00 173.99 43.31 8.77 100.00
sde 0.00 0.00 0.00 112.00 0.00 896.00 16.00 174.00 43.86 8.93 100.00
sde 0.00 0.00 0.00 117.00 0.00 936.00 16.00 174.00 43.50 8.55 100.00
sde 0.00 0.00 0.00 111.00 0.00 888.00 16.00 174.00 44.85 9.01 100.00
sde 0.00 0.00 0.00 109.00 0.00 872.00 16.00 174.00 46.14 9.17 100.00
sde 0.00 0.00 0.00 122.00 0.00 9040.00 148.20 173.49 1704.41 8.20 100.10
sde 0.00 3937.00 0.00 108.00 0.00 23036.00 426.59 152.08 5423.20 9.26 100.00 [1]
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 145.00 0.00 0.00 100.00 [2]
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 145.00 0.00 0.00 100.00 [2]
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 145.00 0.00 0.00 100.00 [2]
sde 0.00 4318.00 0.00 62.00 0.00 8052.00 259.74 153.62 4581.65 16.13 100.00 [3]
sde 0.00 0.00 0.00 120.00 0.00 960.00 16.00 165.00 43.52 8.33 100.00
sde 0.00 0.00 0.00 113.00 0.00 904.00 16.00 165.00 43.56 8.85 100.00
sde 0.00 0.00 0.00 101.00 0.00 808.00 16.00 165.00 47.34 9.90 100.00
sde 0.00 17145.00 0.00 172.00 0.00 76092.00 884.79 157.53 12375.68 5.81 100.00
sde 0.00 0.00 0.00 80.00 0.00 2656.00 66.40 147.50 805.12 12.50 100.00
sde 0.00 5327.00 0.00 121.00 0.00 9440.00 156.03 163.03 322.04 8.27 100.10
-
[1] the await time is 5.4s, which means most of the I/O completing was queued up 5 samples before. so this
is foot print of storage having issues, maybe with I/O priority in that is was completing SOME I/O after
all even with some I/O being stalled. -
[2] is prototypical footprint of hardware stall; 100% util indicating there were I/O outstanding to the device the
whole sample time (in this case 3 consequetive 1 second samples), and zero I/O were completing. -
[3] then hardware becomes "unstuck" and I/O completions flood out; the await time reflects the completing io
were oustanding to the device on average for 4.5 seconds. That is, the I/O was submitted to the disk 4.5
seconds ago and is just now completing.
The above pattern will repeat every once in a while. Always recovers, but this f/w rev was a great way of
showing the issue. Here is a longer snippet of the same issue on another device on the same controller but
different RHEL version with updated driver. The problem is still present. Only after a firmware upgrade
within the storage controller did the problem no longer occur.
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 39.00 0.00 312.00 16.00 0.34 8.56 8.67 33.80
sdf 0.00 0.00 0.00 119.00 0.00 952.00 16.00 0.99 8.34 8.36 99.50
sdf 0.00 0.00 0.00 120.00 0.00 960.00 16.00 0.99 8.29 8.27 99.20
sdf 0.00 0.00 0.00 120.00 0.00 960.00 16.00 1.00 8.32 8.30 99.60
sdf 0.00 0.00 0.00 190.00 0.00 1520.00 16.00 1.59 8.31 5.25 99.70
sdf 0.00 0.00 0.00 304.00 0.00 2432.00 16.00 3.13 10.27 3.29 100.00
sdf 0.00 0.00 0.00 378.00 0.00 3024.00 16.00 4.01 10.52 2.65 100.00
sdf 0.00 0.00 0.00 420.00 0.00 3360.00 16.00 4.99 11.98 2.38 100.00
sdf 0.00 0.00 0.00 397.00 0.00 3176.00 16.00 4.99 12.32 2.52 100.00
sdf 0.00 0.00 0.00 424.00 0.00 3392.00 16.00 4.98 11.97 2.36 100.00
sdf 0.00 0.00 0.00 281.00 0.00 2248.00 16.00 5.00 14.68 3.56 100.00
sdf 0.00 0.00 0.00 481.00 0.00 3848.00 16.00 4.99 12.17 2.08 100.10
sdf 0.00 0.00 0.00 435.00 0.00 3480.00 16.00 4.99 11.39 2.30 100.00
sdf 0.00 0.00 0.00 301.00 0.00 2408.00 16.00 4.99 16.61 3.32 100.00
sdf 0.00 0.00 0.00 301.00 0.00 2408.00 16.00 4.99 16.52 3.32 100.00
sdf 0.00 0.00 0.00 286.00 0.00 2288.00 16.00 4.99 17.52 3.50 100.00
sdf 0.00 0.00 0.00 359.00 0.00 2872.00 16.00 4.99 13.45 2.79 100.00
sdf 0.00 0.00 0.00 318.00 0.00 2544.00 16.00 4.99 16.14 3.14 100.00
sdf 0.00 0.00 0.00 322.00 0.00 2576.00 16.00 4.99 14.65 3.11 100.00
sdf 0.00 0.00 0.00 262.00 0.00 2096.00 16.00 5.00 20.29 3.82 100.10
sdf 0.00 0.00 0.00 318.00 0.00 2544.00 16.00 4.99 15.00 3.14 100.00
sdf 0.00 0.00 0.00 302.00 0.00 2416.00 16.00 4.99 17.24 3.31 100.00
sdf 0.00 0.00 0.00 359.00 0.00 2872.00 16.00 4.99 13.89 2.79 100.00
sdf 0.00 0.00 0.00 351.00 0.00 2808.00 16.00 4.99 14.17 2.85 100.00
sdf 0.00 0.00 0.00 313.00 0.00 2504.00 16.00 4.99 16.04 3.19 100.00
sdf 0.00 20091.00 0.00 325.00 0.00 18648.00 114.76 41.81 28.15 3.08 100.00
sdf 0.00 20629.00 0.00 223.00 0.00 78676.00 705.61 153.65 580.18 4.48 100.00
sdf 0.00 11176.00 0.00 156.00 0.00 39040.00 500.51 152.13 638.92 6.42 100.10
sdf 0.00 16637.00 0.00 217.00 0.00 67456.00 621.71 150.17 761.00 4.61 100.00
sdf 0.00 5207.00 0.00 152.00 0.00 31956.00 420.47 153.47 650.00 6.58 100.00
sdf 0.00 11174.00 0.00 156.00 0.00 46596.00 597.38 157.97 1106.33 6.41 100.00
sdf 0.00 16891.00 0.00 208.00 0.00 65656.00 631.31 154.96 955.19 4.81 100.00
sdf 0.00 16781.00 0.00 185.00 0.00 56912.00 615.26 154.23 696.17 5.41 100.00
sdf 0.00 10922.00 0.00 210.00 0.00 60704.00 578.13 151.28 833.13 4.76 100.00
sdf 0.00 19931.00 0.00 243.00 0.00 77516.00 637.99 151.31 763.74 4.12 100.10
sdf 0.00 29245.00 0.00 305.00 0.00 115280.00 755.93 148.39 529.52 3.28 100.00
sdf 0.00 15205.00 0.00 130.00 0.00 52944.00 814.52 151.84 547.58 7.69 100.00 << I/O submitted
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 163.00 0.00 0.00 100.00 << stall #1
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 163.00 0.00 0.00 100.00 << stall #2
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 163.00 0.00 0.00 100.00 << stall #3
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 163.00 0.00 0.00 100.00 << stall #4
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 163.00 0.00 0.00 100.00 << stall #5
sdf 0.00 5715.00 0.00 70.00 0.00 32300.00 922.86 161.52 6893.67 14.30 100.10 << I/O completes
sdf 0.00 12191.00 0.00 99.00 0.00 44636.00 901.74 153.33 6043.07 10.10 100.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << stalls again
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << "
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << "
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << "
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << "
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.15 0.00 0.00 100.10 << "
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << "
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 153.00 0.00 0.00 100.00 << "