What exactly is the meaning of value "%util" reported by iostat?
- ⤵ What is meaning of value in %util as reported by iostat?
- ⤵ What is the meaning of the %util field in 'iostat -x' output?
- ⤵ How does
iostatcalculate the%util? - ⤵ If it reaches 100%, does it mean my disk is fully utilized? Completely saturated?
- ⤵ How to determine how much I/O load is present using the value of %util as shown when running 'iostat -x' in RHEL
- ⤵ How to determine how much device capacity is being used and provide warning when capacity is being exceeded using 'iostat -x'
-
What the exactly meaning of value "%util" reported by iostat?
The best answer is to think of
%utilas%busy-- the percentage of time within a sample period that the device was busy processing I/O. That is what is actually being measured by this statistic. It's the only thing that is being measured. Anything else is just an attempt at an interpretation of this%busymeasurement to mean something else.#man iostat %util: Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.Parts of the manpage description, those dealing with interpretation of %busy as relating to 100% utilization, are from a specific technology context and are not necessary valid for all configurations. Especially much less so with any modern storage device that is capable of queueing multiple IO (NCQ/TCQ) within a device, or capable of processing multiple I/O simultaneously (SSD, NVMe, RAID, etc.).
The
%utilis the percentage of the sample time interval during which there was at least 1 outstanding I/O request within the io scheduler, driver, storage combo. Or more simply,%utilis activity percentage -- the percentage of time there is I/O activity present on the disk. The percentage of time the device was busy servicing I/O. So if the interval time was 5 seconds and%utilwas 50%, then during 2.5 seconds of that sample there was at least one I/O oustanding somewhere within the "device"; where device = io scheduler + driver + storage disk combo. -
How is "%util" time measured and calculated?
The
%utilis calculated by tracking when io is submitted into the schedule and when io completes. An internal accumulator is used to track the number of jiffies (timer ticks) during which io, any io, was present within the lower portion of the kernel's io stack. As long as the count of outstanding io there is greater than 0, time is accumulated towards%util. For example:^ #io | +----+ +--+ | | | | | | | +----+ | | +---+ +-----+ +---+ | | | | | +----+----------------------+---+---+--------------> time ^ util | | | a b c d | +----------------------+ +---+ | | | | | +----+----------------------+---+---+--------------> dTime(1) dTime(2)In the above case, the time from a->b and c->d is accumulated into the utilization counter. The key piece of kernel code is below. If there are any oustanding io 'in_flight', then the io_ticks accumulator has ('now'-'then') added to it. The 'then', disk->stamp, is just the last time this calculation was performed. The disk structure referenced in the following code is a
struct gendisk.if (disk->in_flight) { : __disk_stat_add(disk, io_ticks, (now - disk->stamp)); } disk->stamp = now;When iostat reads
/proc/diskstats, the kernel routine converts the utilization counter from internal ticks or jiffies into milliseconds. This value shows up in field #10 of /proc/diskstats -- see /usr/share/doc/kernel-doc-*/Documentation/iostats.txt. The iostat command subtracts the value read at the end of the interval from the value read at the beginning of the interval. The results is the number of milliseconds within the sample time during which there was any io oustanding in the lower portion of the kernel io stack (scheduler/driver/storage). -
If at 100% utilization mean this mean the device is saturated?
No. While device saturation can occur when this value is close to 100% ... it only does so for older non-NCQ capable direct attach HDD devices and AT THE CURRENT LOAD POINT as shown by the other data within the sample. But typically 100% does not have any usable correlation on disk capacity used (aka "utilization") -- it only reflects sample time percentage "used", that is the percentage of sample time an io was present -- device busy time. For example, here are two different iostat samples for a local attached SATA disk. Both samples show 100% utilization:
Device:rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.00 12654.00 16.00 6327.00 84.00 1.01 1.99 0.16 0.08 100.00 :
sda 0.00 7.00 183.00 22.00 93696.00 116.00 915.24 17.86 85.22 4.88 100.00Although both samples have 100% utilization, the throughput reported is 6MB/s in one case and 94MB/s in the second sample.
The difference is that these are two different I/O load points. In the first case, the average io size is 1 sector per io request and in the second the average io size is 915 sectors per io request. In both cases, the single disk head associated with the rotating HDD device is always busy doing io so isn't free to handle any more io. The difference in throughput is because the disk is more efficient at handling larger io than smaller io. Both types of io have similar overhead within storage, including moving and setting up the head to be over the data on the disk platter. In the first case, that overhead is likely a major percentage of the average storage service time. Whereas in the second case that overhead is spread over 900+ times more data transfer.
The I/O load point consists of the number of read/write I/Os, the amount of merging of same, the average read and write IO request size, and how random or sequential the work load is, the amount of parallel (simultaneous) IO submitted, as well as several other factors.
With things like backplane RAID or SAN storage, the
%utilis even more divorced from device saturation because the logical lun presented to the host is typically backed by a number of physical drives rather than just one drive. Moreover, those physical drives may be shared with other logical luns and even other, different, hosts. In this case having exactly 1 io always outstanding within the sample means you have 100% (sample time) utilization -- device is 100% busy servicing IO from a time perspective -- but it also likely means that only 1 physical disk behind the logical (host) lun was actually doing any work at any give time, any others were idle. So, with SAN luns having 10, 20, or even 50 or more physical disks behind them within a RAID configuration, this means that device saturation can be a long way away from just achieving 100% (sample time) utilization.For example, here is a backplane RAID logical volume that the kernel has discovered as
/dev/sdcthat we put under a single and then doubleddread command io load:Device:rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 15797.00 0.00 252752.00 0.00 32.00 0.85 0.05 0.05 0.00 0.05 84.70
sdc 0.00 0.00 15907.00 0.00 254512.00 0.00 32.00 0.85 0.05 0.05 0.00 0.05 84.80sdc 0.00 0.00 16420.00 0.00 262720.00 0.00 32.00 1.83 0.11 0.11 0.00 0.06 99.80
sdc 0.00 0.00 16003.00 0.00 255976.00 0.00 31.99 2.06 0.13 0.13 0.00 0.06 100.00What we see is that one dd command is not able to keep the device 100% busy, but two IO in parallel can -- the device shows 100% utilization. So if utilization was really device utilization, then no further performance could be squeezed from this device. But as we see below, increasing the parallel io load on the device results in the utilization staying at 100% but the throughput increasing by +80% to almost 500MB/s from 250MB/s. The reason being that the sdc device is actually made up of multiple physical disks behind the backplane RAID controller. This is a example that shows that
%utilis not useful for measuring or reflecting what amount of a device's iops or throughput capability is being utilized ... that is not what %util represents or measures.Device:rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 30964.00 0.00 495440.00 0.00 32.00 7.31 0.24 0.24 0.00 0.03 100.00 sdc 0.00 0.00 28749.00 0.00 459968.00 0.00 32.00 7.68 0.27 0.27 0.00 0.03 100.10 -
How to determine how much I/O load is present using the value of %util as shown when running 'iostat -x' in RHEL
That question is variously phrased but the underlying request is the same: how can we use
%utilto determine when we're maxing out storage io load support capability and if we cannot use%utilthen how can we use iostat or some other standard linux tools for same. First, iostat is a reporting tool only. Second,%utilis just device%busy(time), not reflective of capacity utilization. And finally, interpretation of its data is up to the individual sysadmin as they know their application and hardware configuration as well as expectations (e.g. for one sysadmin a 10ms/io latency is fine whereas for another anything >1ms/io indicates storage capacity overload).The short answer is: monitoring
iostat -xas a simplistic means of determine amount of storage performance capacity used is not possible. Determining such data requires several things including, but not exclusively:- The application suite's IO load profile to be tracked (mix of read vs write, io size, number of io in parallel, synchronous vs asynchronous, random vs sequential, etc.)
- The amount of the above IO load profile needed as a baseline.
- The scale-out of the baseline that is expected over time.
- The acceptance criteria in terms of iops, throughput and/or latency for the baseline and scale-out profiles.
Such information is not only application suite specific, but also customer configuration instance specific. A database environment may be dealing with 100s of megabytes or 100s of petabytes of data -- those two specific instances are radically different in terms of baseline, scale-out and performance acceptance values even if the io load profiles end up being the same.
Often third party monitoring solutions fixate on
%utilas a indicator of storage performance capacity related issue. This is based upon%utilbeing interpreted as a device capacity "utilization" percentage. That basis is invalid and non-factual. As detailed above%utilis sample time%busyand generally decoupled and unrelated as any type of storage performance capacity meaurement. So if a monitoring solution is triggering alerts on high%utilbasis only, find a better monitoring solution and/or disable that monitoring rule.So is there a way to determine how much device capacity is being used and provide warning when said maximum capacity is being approached or exceeded? The short answer is that there isn't an easy way of measuring such details with storage -- at least from the linux side of things. The best solution for that type of thing is to engage the storage vendor directly for a solution. They both understanding their specific storage architecture, how its configured within a specific instance, and how to properly measure whether device maximum (performance: iops, throughput, latency, etc.) capacity is being approached or saturated.
Doing so from a linux perspective requires a deep understanding on the application suite being run on any given system. And then performing capacity acceptance testing. This typically involves some type of benchmarking whereby either an predefined application test is run -- for example an Oracle test set of data and certain query scripting -- or a benchmark stand-in has been crafted by a user to reflect the io load expected -- for example a fio script with specific read vs write, io size, queue depth, random vs sequence, synchronous vs asynchronous specifications that mimic a customer's expected application io load. With the latter a go/no-go result occurs based upon reaching some combination of iops, throughput, and/or latency. From there, typical scale-out testing is performed. Essentially multiplying the data size size/queries for things like databases or increasing the number or variety of fio scripts running on the system to ascertain how the storage system response to expected larger io loads as the amount of data or processes scale out as expected over time.
All of that is beyond the typical break/fix Red Hat support functions, but may be available from This content is not included.Red Hat Consulting Services.
As further example of how simple monitoring can lead to erroneous conclusions, lets look at io latency (await time). The latency time shown by
iostat -x, in the vast majority of cases, is hardware based latency. This can be directly in terms of storage latency itself for individual io, or indirectly due to:- elevated storage hardware latency with a high queue (avg-qu) value which creates queuing 'backpressure'. For example, if there are 100 io queued waiting to be dispatched to storage and 32 io already dispatched (with lun queue depth=32), then that 100th io will need to wait for 99+32 prio io to be dispatched and completed before it will be dispatched and completed (yeah, that simplifies out of order execution etc features in storage but the general concept applies even in those cases). If normal await times are <1ms but right now due to storage controller congestion the hardware latency as seen from linux is 2ms, then that 100th io is going to have an await time of ~262ms. If incoming io continue at the same rate then the average await time as reported by iostat will float up to 250+ms per io even though at the storage level each io is taking only 2ms. The long queue is adversely affected by storage latency 'backpressure' of not being process and clear io fast enough.
- local HBA resources issues forcing what is called 'requeues'. For example, a FC hardware link level buffer credit leak or slow return of credits will reduce the througput performance of data frames on the link. This, and other similar issues at the HBA level, can result in io that is dispatched to a lun queue that should be able to accept and transmit io to storage being temporarily unable to do so. This results in io submitted to the HBA being rejected back to the io scheduler layer for requeue. The io is requeued to the top of the list and upon the next io completion returned from the HBA, or after a short delay, the io is again submitted to the driver/HBA for transmission to storage. These requeues insert a force delay in io flow due to some underlying, but temporary, local HBA congestion issue. But again this induces a type of 'backpressure' within the flow of io that artificially extends the latency of the io.
Add into this the fact that not all slow io reflects a serious issue, and a simple monitoring of await time is again not a useful event unless the monitoring software takes other factors into account.
- There is a concept of 'low census' issues. A low census happens when there are few io present from a device and yet an outsized storage latency is present. First and foremost, typically the less than a handful of io taking say 100ms each is seldom going to be impactful on a system's or application's performance just because there are so few io present. Secondly, typically a low census issue (low number, e.g 1-5, io present but much larger await/latency time such that it triggers a simple monitor rule based on latency alone) is due to hardware congestion outside of linux. If a storage controller is handling 1,000s of io for the current host to multiple luns as well as possibly io for other hosts to other luns and a couple io are issued -- queuing theory says that mixing in 1-5 into 1000s the controller is working on will introduce detectable processing delays within the storage controller for those few io. And while the average storage level io completion will still show no problems within storage, and even storage controller level stats not show an issue, those handful of io due to queuing still get hit will delays that otherwise wouldn't be seen. And again, so few io are present, actual detectable performance impacts are not very likely to be present.
- There are a few outlier latency samples. Often a Content from en.wikipedia.org is not included.Tukey analysis is run on iostat data. And while its not plotted, the resulting mathematical analysis still highlights the intent of Tukey namely the determination of whether a sample is an outlier or not. Outliers again typically do not represent systemic performance issues but rather temporary hardware congestion issues. An analysis as outlined by the box and whisker plot provides a mathematical definition of what is an outlier value (in our case, within await/latency time). These are hardware related issues. And while linux tuning can sometimes help reduce the periodic or peaky io loads in some cases -- it cannot directly address the underlying storage hardware issues themselves. But if they are outliers, then again these only result in periodic vs systemic issue.
So just triggering on high await is not necessarily indicative of an issue, especially if its only periodic and outlier in nature.
All the above feeds into not only determining the capacity available for a specific hardware configuration, or its base-line and scale out response, but also how to detect when device capacity is actually used up vs just being a temporary hardware congestion outlier issue. All of that is not easily quantified via simply looking at the output of
iostat -x.