Is the svctm's output of iostat unreliable as described within the man page?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 9
  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6

Issue

  • The following is a portion of iostat's man page. It provides a warning and indicates the svctm field will be removed in a future sysstat version.

      # man iostat 
      :
      svctm     The average service time(in milliseconds) for I/O requests that were issued to the device.
                Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
    
  • If the output of svctm from iostat cannot be trusted, are there any alternative fields or tools instead of svctm field?

  • The iostat output shows await much larger than svctm. There is other linux documentation on the web that states that since await time is OS level waiting and the svctm is disk response time, then this means await - svctm is linux kernel overhead and the numbers are showing a kernel bug or need for tuning. Is this correct?

Resolution

  • The values within the scvtm column are accurate -- but they do not represent what many think it measures. It does represent "the average service time(in milliseconds) for I/O requests that were issued to the device" but from a host perspective rather than a storage perspective which is the main issue. As such, it does not have much usefulness anymore. See "Root Cause" below for more information.
  • The await field, with caveats, provides a better view of disk latency in almost all cases.
  • For finer granularity information, blktrace can be used to track and determine IO dispatch (to driver/storage) to completion time which for most storage types will closely track "storage" latency (service) time.
    • "Storage" latency time includes driver time, host storage controller time, transport transfer time, storage controller time (for example, FC storage box), and finally disk latency time.

Await vs Svctm

  • If await is much larger then svctm then there is a bug or tuning needed within the kernel. Is that correct?
    • No, that is incorrect.
    • The svctm is a simply calculation of the amount of time IO was present during the sample time divided by the total number of IO that completed within that sample time. So in essence, svctm is wall clock time based.
  • The await is total IO waiting time within the kernel and svctm is disk latency. Is that correct?
    • No, that is incorrect.
    • The await time is the measured average of the total IO time spent in the lower kernel io stack (scheduler,driver,hardware including disk to io complete time). In almost all cases, the await time reflects measured hardware latency.
      • The difference of await - svctm is not related to any time of kernel overhead -- the two measurements are not contextually aligned in such a manner that allows such math results of having any such significance.

In cases with really really old storage disks that do not support NCQ (native command queueing), nor TCQ (tagged command queueing), are not serviced by a backplane controller, are not SAN based or are not any other current disk technology that supports parallel IO submission to the disk (lun queue depth > 1), then in that case and that case alone is svctm and approximation of total IO latency of time in linux plus disk latency. And at that point await will be the same, or nearly the same, value as svctm. Both will be measuring the same thing although in different manners.

Root Cause

  • First and foremost, all disk io statistics use the same data, namely /proc/diskstats - sar, iostat, dstat all use this data or the equivalent individual disk statistics from /sys/block/sda/stat.

  • The common view is that svctm is a measurement of disk io service time, but that is not what it represents. Especially with backplane raid controllers like Smart Array, or enterprise FC storage like EMC which support simultaneous, parallel, io servicing.

  • The await and svctm columns are are completely different from each other in how they are derived:

    • Await is measured, per io, from the point the io is submitted to the device until the io completes.
    • Svctm is calculated by taking the (time-period * %utilization) / number-of-io-completed-in-time-period -- that is, its a calculated average "effective" io rate for storage.
      • For example, 10 io are simultaneously started. All 10 io are done within storage in parallel and take 100ms. All 10 io complete simultaneously after 100ms. Await will report an average of 100ms, svctm will report 10ms. A svctm of 10ms does not accurately represent storage latency while await does.
  • Since await time includes any time the io spends queued and waiting before the io is dispatched through the driver and onto storage, you have to apply careful interpretation of await time as to how closely it models storage latency time. Again, see "What exactly is the meaning of value "await" reported by iostat?" for more details on this.

  • See the file /usr/share/doc/kernel-doc-*/Documentation/iostats.txt for a complete description of all the fields within /proc/diskstats.

    • The svctm is based upon Fields #10 (wall clock time io is present in the system) and Fields #1 and #5.
    • The await is based upon Fields #4 and #8 along with Fields #1 and #5.
    • Field 1 -- # of reads completed
      • This is the total number of reads completed successfully.
    • Field 5 -- # of writes completed
      • This is the total number of writes completed successfully.
    • Field 4 -- # of milliseconds spent reading
      • This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last()).
    • Field 8 -- # of milliseconds spent writing
      • This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()).
    • Field 10 -- # of milliseconds spent doing I/Os
      • This field increases so long as field 9 is nonzero.
  • So in the case of 10 read IO being performed with each having 100ms of disk latency, then Field #4 will increase by 1000ms for that sample period (linux kernel overhead is typically much smaller than 1ms per IO). Regardless of whether those 10 IO are all performed in a synchronously parallel nature, or one-at-a-time serially, the calculated r_await time will be 100ms. Whereas the svctm value will be somewhere between 10ms (if all 10 io are performed in parallel (full overlap)) to 100ms (all IO performed serially (no overlap)). Such a svctm calculation is accurate -- it correctly reflects the data within the fields, but it is not very useful.

  • Here is a typical result from using blktrace. Note that Q2C is the time that will be averaged and reported by iostat as await while the Q2D time is the sub-portion of await that is the linux kernel overhead for that io. The kernel overhead is averaging less than 100ns across this set of IO (0.008% "overhead").


Total kernel overhead --------------+            + hardware latency
                                    |            |                  +----- iostat await time
                                    |            |                  |
                                    V            V                  V
Prg :Flt  Queue(sec)   Q2I(ms)     Q2D(ms)       D2C(ms)            Q2C(ms)     
230:185   0.172119422   0.0197      0.031533   1318.699790 % 99   1318.731323   
230:184   0.171646951   0.2031      0.217260   1318.991101 % 99   1319.208361   
229:184   0.171698708   0.1517      0.173349   1318.984705 % 99   1319.158054   
228:183   0.171704514   0.1461      0.172201   1318.985747 % 99   1319.157948   
227:183   0.171663361   0.1869      0.204671   1318.999750 % 99   1319.204421   
226:182   0.502734133   0.0412      0.050131    988.084559 % 99    988.134690   
227:181   0.705809615   0.0016      0.002200    785.079221 % 99    785.081421  1
226:181   0.503055838   0.0086      0.009667    987.832437 % 99    987.842104   
225:180   0.102185481   0.0225      0.031573   1388.683495 % 99   1388.715068   
224:180   0.502285415   0.0280      0.064619    988.553117 % 99    988.617736   
224:180   0.502272890   0.0400      0.071461    988.570517 % 99    988.641978   
223:179   0.502278077   0.0352      0.069340    988.568834 % 99    988.638174   
222:178   0.502599906   0.0849      0.090902    988.227568 % 99    988.318470   
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.