Is the svctm's output of iostat unreliable as described within the man page?
Environment
- Red Hat Enterprise Linux 9
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 6
- Exceptions:
- RHEL 9
- Due to its general unreliable nature, the
svctmcolumn has been removed fromiostatoutput starting with RHEL 9. Per Content from github.com is not included.systats 🔗 git repository:2018/12/14: Version 12.1.2 - Sebastien Godard (sysstat
orange.fr) * Content from github.com is not included.iostat/sar: Remove service time (svctm) metric. Service time value (displayed as svctm by sar and iostat) could no longer be trusted and has been made obsolete for several years. So remove it now from sar (sar -d) and iostat output.
Issue
-
The following is a portion of iostat's man page. It provides a warning and indicates the svctm field will be removed in a future sysstat version.
# man iostat : svctm The average service time(in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version. -
If the output of svctm from iostat cannot be trusted, are there any alternative fields or tools instead of svctm field?
-
The iostat output shows
awaitmuch larger thansvctm. There is other linux documentation on the web that states that sinceawaittime is OS level waiting and thesvctmis disk response time, then this meansawait-svctmis linux kernel overhead and the numbers are showing a kernel bug or need for tuning. Is this correct?
Resolution
- The values within the scvtm column are accurate -- but they do not represent what many think it measures. It does represent "the average service time(in milliseconds) for I/O requests that were issued to the device" but from a host perspective rather than a storage perspective which is the main issue. As such, it does not have much usefulness anymore. See "Root Cause" below for more information.
- The await field, with caveats, provides a better view of disk latency in almost all cases.
- For finer granularity information,
blktracecan be used to track and determine IO dispatch (to driver/storage) to completion time which for most storage types will closely track "storage" latency (service) time.- "Storage" latency time includes driver time, host storage controller time, transport transfer time, storage controller time (for example, FC storage box), and finally disk latency time.
Await vs Svctm
- If
awaitis much larger thensvctmthen there is a bug or tuning needed within the kernel. Is that correct?- No, that is incorrect.
- The
svctmis a simply calculation of the amount of time IO was present during the sample time divided by the total number of IO that completed within that sample time. So in essence,svctmis wall clock time based.
- The
awaitis total IO waiting time within the kernel andsvctmis disk latency. Is that correct?- No, that is incorrect.
- The
awaittime is the measured average of the total IO time spent in the lower kernel io stack (scheduler,driver,hardware including disk to io complete time). In almost all cases, theawaittime reflects measured hardware latency.- The difference of
await-svctmis not related to any time of kernel overhead -- the two measurements are not contextually aligned in such a manner that allows such math results of having any such significance.
- The difference of
In cases with really really old storage disks that do not support NCQ (native command queueing), nor TCQ (tagged command queueing), are not serviced by a backplane controller, are not SAN based or are not any other current disk technology that supports parallel IO submission to the disk (lun queue depth > 1), then in that case and that case alone is svctm and approximation of total IO latency of time in linux plus disk latency. And at that point await will be the same, or nearly the same, value as svctm. Both will be measuring the same thing although in different manners.
Root Cause
-
First and foremost, all disk io statistics use the same data, namely
/proc/diskstats- sar, iostat, dstat all use this data or the equivalent individual disk statistics from/sys/block/sda/stat. -
The common view is that svctm is a measurement of disk io service time, but that is not what it represents. Especially with backplane raid controllers like Smart Array, or enterprise FC storage like EMC which support simultaneous, parallel, io servicing.
-
The await and svctm columns are are completely different from each other in how they are derived:
- Await is measured, per io, from the point the io is submitted to the device until the io completes.
- Svctm is calculated by taking the
(time-period * %utilization) / number-of-io-completed-in-time-period-- that is, its a calculated average "effective" io rate for storage.- For example, 10 io are simultaneously started. All 10 io are done within storage in parallel and take 100ms. All 10 io complete simultaneously after 100ms. Await will report an average of 100ms, svctm will report 10ms. A svctm of 10ms does not accurately represent storage latency while await does.
-
Since
awaittime includes any time the io spends queued and waiting before the io is dispatched through the driver and onto storage, you have to apply careful interpretation ofawaittime as to how closely it models storage latency time. Again, see "What exactly is the meaning of value "await" reported by iostat?" for more details on this. -
See the file
/usr/share/doc/kernel-doc-*/Documentation/iostats.txtfor a complete description of all the fields within/proc/diskstats.- The
svctmis based upon Fields #10 (wall clock time io is present in the system) and Fields #1 and #5. - The
awaitis based upon Fields #4 and #8 along with Fields #1 and #5. - Field 1 -- # of reads completed
- This is the total number of reads completed successfully.
- Field 5 -- # of writes completed
- This is the total number of writes completed successfully.
- Field 4 -- # of milliseconds spent reading
- This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last()).
- Field 8 -- # of milliseconds spent writing
- This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()).
- Field 10 -- # of milliseconds spent doing I/Os
- This field increases so long as field 9 is nonzero.
- The
-
So in the case of 10 read IO being performed with each having 100ms of disk latency, then Field #4 will increase by 1000ms for that sample period (linux kernel overhead is typically much smaller than 1ms per IO). Regardless of whether those 10 IO are all performed in a synchronously parallel nature, or one-at-a-time serially, the calculated
r_awaittime will be 100ms. Whereas thesvctmvalue will be somewhere between 10ms (if all 10 io are performed in parallel (full overlap)) to 100ms (all IO performed serially (no overlap)). Such asvctmcalculation is accurate -- it correctly reflects the data within the fields, but it is not very useful. -
Here is a typical result from using
blktrace. Note that Q2C is the time that will be averaged and reported by iostat asawaitwhile the Q2D time is the sub-portion ofawaitthat is the linux kernel overhead for that io. The kernel overhead is averaging less than 100ns across this set of IO (0.008% "overhead").
Total kernel overhead --------------+ + hardware latency | | +----- iostat await time | | | V V V Prg :Flt Queue(sec) Q2I(ms) Q2D(ms) D2C(ms) Q2C(ms) 230:185 0.172119422 0.0197 0.031533 1318.699790 % 99 1318.731323 230:184 0.171646951 0.2031 0.217260 1318.991101 % 99 1319.208361 229:184 0.171698708 0.1517 0.173349 1318.984705 % 99 1319.158054 228:183 0.171704514 0.1461 0.172201 1318.985747 % 99 1319.157948 227:183 0.171663361 0.1869 0.204671 1318.999750 % 99 1319.204421 226:182 0.502734133 0.0412 0.050131 988.084559 % 99 988.134690 227:181 0.705809615 0.0016 0.002200 785.079221 % 99 785.081421 1 226:181 0.503055838 0.0086 0.009667 987.832437 % 99 987.842104 225:180 0.102185481 0.0225 0.031573 1388.683495 % 99 1388.715068 224:180 0.502285415 0.0280 0.064619 988.553117 % 99 988.617736 224:180 0.502272890 0.0400 0.071461 988.570517 % 99 988.641978 223:179 0.502278077 0.0352 0.069340 988.568834 % 99 988.638174 222:178 0.502599906 0.0849 0.090902 988.227568 % 99 988.318470
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.