What exactly is the meaning of value "await" reported by iostat?

Updated

Issue

  • What exactly is the meaning of value "await" reported by iostat?
  • How is "await" time measured and calculated?
  • How is "await" different from "svctm"?
  • What would cause high values within "await"?

Environment

  • Red Hat Enterprise Linux 4, 5, 6, 7, 8, 9

Resolution

The await time as displayed by iostat is a measure of io latency in milliseconds. The latency is measured from the front of the io scheduler until io completion time. This code path includes the scheduler and possibly queueing there, the HBA driver, transport (bus) time, switch routing time in the case of FC or FCoE, storage controller queueing and processing time, plus actual storage device latency. If the iostat is gathered from a virtual machine guest for a virtual device, then that code path also includes hypervisor scheduler and possibly multipathing and/or filesystem layers plus HBA driver on the hypervisor.

The iostat man page provides a general description of the await field;

    #man iostat
    
     await
     The average time (in milliseconds) for I/O requests issued to the device to be served.
     This includes the time spent by the requests in queue and the time spent servicing them.

How is "await" time measured and calculated?

The await time is measured on a per io basis from the front of the io scheduler until io done time. It covers the time that is taken through the scheduler, driver, controller, transport (for example fibre san), and storage needed to complete each io. Await is the average time, in milliseconds, for I/O requests completed by storage and includes the time spent by the requests in the scheduler queue and time spent by storage servicing them.

There are two queues within this code path: one in the scheduler and one somewhere out on the hardware side.

The depth of the one within the scheduler is controlled by nr_requests and is essentially the maximum number of io allowed within the sorted queue. If a process attempts to schedule an io when the io scheduler is already full and that io cannot be merged into an existing io already in the scheduler queue then the thread blocks.

NOTE! The 'nr_requests' parameter is applied to reads and writes separately. So you can have upwards or 'nr_requests' number of reads AND 'nr_requests' number of writes simultaneously within the scheduler. If performing just reads (or just writes), it is possible to 'avgqu-sz' reach some limit where that limit is dictated by 'nr_requests' + lun 'queue_depth'. Once io is passed onto the driver, it is no longer in the scheduler sort queue so doesn't count towards the 'nr_requests' limit but will count towards the `avgqu-sz`.

The lun queue is typically within the hardware of the controller. For backplane raid controllers, this is in the host bus adapter itself. Such backplane raid controllers include HP's Smart Array controller series for example. For san or iscsi based storage, the lun queue is within the storage controller. Such storage controllers include EMC's Symmetrix controller series for example. The queue_depth controls the maximum number of io submitted to storage at any given time for a device. The driver itself does not queue io, although it will track which io have been submitted to the storage host bus adapter (HBA) that haven't completed yet.

How is "await" different from "svctm"?

The difference between await and svctm is that await is directly measured for each io and in essence, doesn't take into account parallel or simultaneous io. Whereas svctm is indirectly measured and does take into account parallel io operations. The svctm field is simply a calculation of (%util * 1000ms/second) / #io-completed-per-second. This calculation yields an effective io rate for storage. It is unrelated to any representation of disk latency and its use for any given specific purpose is suspect. The svctm column has been removed from the latest upstream Content from github.com is not included.sysstats 🔗 package, which includes iostat, back in 2018.

For example, if 10 io were simultaneously submitted to the scheduler and then onto storage (queue_depth > 10), and then all 10 io completed 10ms later, the reported await time would be 10ms (each io took 10ms to complete on average), but the svctm would be calculated as 1ms (10 io were completed in 10ms).

What would cause high values within "await"?

The avgqu-sz is the average number of io contained within both the io scheduler queue and storage lun queue. If the reported avgqu-sz in the sample is much less than the allowed lun queue_depth (and the sample time is small so averaging isn't hiding a much larger peak queue size), then there is little time spent within the scheduler queue. The scheduler will continue passing io to the driver while the number of io still outstanding to the driver (io currently being worked on by storage) remains under the lun's queue_depth limit. In such cases, then the await time is dominated by storage servicing time alone. Under these circumstances(low avgqu-sz), high await time is due to storage side issues.

Once avgqu-sz gets near the lun queue_depth or exceeds it, then some io is being retained within the scheduler sort queue until such time as the amount of io in progress against the storage hardware drops back under the queue_depth limit. Under these circumstances, it is difficult to determine if the await time is due to time the io spends queued within the scheduler or time the io spends being serviced by storage. You can either perform some io tests that keep the number of simultaneous io under the lun queue_depth limit or could try increasing the lun queue_depth limit so that more simultaneous io is allowed to be in progress against the hardware.

It is not necessarily a bad thing to have avgqu-sz exceed the lun queue_depth periodically, however, if this is a sustained condition it will impact perceived system performance. This is a case of the system generating more io than what storage is able to service in a sustained manner. Some causes may be mitigated via tuning to reduce the occurrence of such incidents. For example, one common cause can be accumulated dirty pages within large physical memory configurations. Tuning large memory configurations such that the system doesn't accumulate large quantities of dirty pages can be beneficial. Large amounts of dirty pages often is a cause of spikey write io loads (think 64GB or larger systems with 40% dirty pages and then trying to flush all that 20+GB of dirty data to a disk that is capable of 100MB/s write rates... it can take a long time to flush all that data to disk!).

When incidents of high await time occur, often you need to review iostat and vmstat output along with top data to determine the likely cause.

Diagnositcs

As indicated, await time is measured from front of io scheduler to io done time and includes time spent by the requests within the io scheduler sort queue and driver/adapter/transport/storage. This is 'the time a request has to wait until being served'.

You can see a breakdown of some of this time by using blktrace.... just be aware that blktrace adds overhead so at best you can get a relative value for the different parts and relate them to the system when it's not running blktrace.

So what does blktrace provide? It provides timestamps for io requests along key points within the lower io stack. For example:

Maj,min cpu  seq#           Time    PID Act RWBS Sector + L           Description
   8,0   18     3     0.000001721 30916  Q  WS 173237576 + 8 [mysqld] ; __make_request()
   8,0   18     6     0.000007879 30916  I  WS 173237576 + 8 [mysqld] ; insert request into sorted/merged queue
   8,0   18     8     0.000031626 30916  D  WS 173237576 + 8 [mysqld] ; dispatch io to driver (remove from sched queue)
   8,0    6     1     0.000115301     0  C  WS 173237576 + 8 [0]      ; io done time

Await time measures Q->C time: __make_request() to io done. It has 3 main sub-components, Q->I, I->D, and D->C. The Q->I and I->D are usually lumped together as "scheduler" related time. This includes sort time and awaiting for dispatch to the underlying driver. The I->D time can include a lot of apparent extra time due to plug and unplug events (not shown above) which are used to improve merging of io within the schedule sort queue. But for most purposes, we're interested in D->C time as a ratio of the total Q->C time. D->C time covers driver time, adapter time, transport time, and storage service time (and back). Within the above io, D->C time is 0.083675ms of the total Q->C time of 0.11358ms. Or relationally, D->C was roughly 74% of the total await time. For high D->C times, the underlying transport structure of storage needs to be examined, such as switch counters or maintenance interfaces for storage boxes themselves.

So other diagnostic tools that can be used include:

  • blktrace,
  • iotop (RHEL5.7 or later), and
  • ps (or similar).

As indicated above, one diagnostic tool that can be used is blktrace.... but that will change io timing so cannot directly relate numbers in blktrace to numbers in iostat from previous runs which didn't have blktrace running. That means blktrace and iostat need to be run at the same time to help correlate numbers. But blktrace is more a microscopic view of io flow and most times won't necessarily shed additional light on the issue, especially if 75% of the await time is driver/controller/transport/storage related. The previous explanation of low average queue depth plus high await time = storage side issue holds.

The iotop (from iotop* package) can be used to see which processes are contributing what portion to the overall io load. For example iotop -tbo -d 1

Another diagnostic that can be used are the various commands that report the process table (ps, top) so you can check for processes in D state, which means Uninterruptible sleep and is usually caused by processes waiting for io.

These last two, iotop and ps, are useful for identifying which processes are contributing to the io load of the system. The blktrace can be used to examine more closely the parts of the io flow within and through the scheduler.

Category
Components
Article Type