Understanding the Deadline IO Scheduler

Updated 30 Aug 2023

Issue

How do I turn on deadline scheduler for a device?
What are the tunables for deadline scheduler and what do they do?
How does the logic within the scheduler work in choosing which IO to dispatch next?

Environment

Red Hat Enterprise Linux 4, 5, 6, 7 , 8

Resources

" Using the Deadline IO Scheduler"
" How to make the deadline I/O scheduler more write-friendly"
"Are there any adverse implications when lowering the default values of either write_expire and/or read_expire of deadline IO scheduler?"
"Question regarding tuned profiles and elevator deadline"
" What tuning parameters are available for the mq-deadline I/O scheduler? " mostly duplicate of information below
Related Documentation Topics

Resolution

Enable deadline io scheduler

In RHEL8, 9: via the /sys filesystem to change on a per disk basis -- although most devices will already default to mq-deadline:

$ echo 'mq-deadline' > /sys/block/sda/queue/scheduler
$ cat               /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

In RHEL5, 6, 7: via the /sys filesystem to change on a per disk basis

$ echo 'deadline' > /sys/block/sda/queue/scheduler
$ cat               /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq

RHEL4, RHEL5, RHEL6: via adding "elevator=deadline" to the end of the kernel line in /etc/grub.conf file to change the default scheduler from cfq to deadline for all devices (note: this boot line feature is parsed in later versions but is otherwise ignored):

title Red Hat Enterprise Linux Server (2.6.9-67.EL)
root (hd0,0)
kernel /vmlinuz-2.6.9-67.EL ro root=/dev/vg0/lv0 elevator=deadline
initrd /initrd-2.6.9-67.EL.img
 
NOTE: in RHEL4 the io scheduler selection cannot be applied on a per-disk basis, but only as the global default.

Deadline tunables

$ ls /sys/block/sda/queue/iosched
fifo_batch          def:    16
front_merges        def:     1 (enabled)
read_expire         def:   500ms
write_expire        def:  5000ms
writes_starved      def:     2

This content is not included.async_depth         def:   192       # added in RHEL 8 and later
This content is not included.prio_aging_expire   def: 10000ms     # added in RHEL 9

Default Settings

Default settings in RHEL 9:

$ cat /sys/block/sda/queue/scheduler [mq-deadline] kyber bfq none $ grep -Hv "zzz" /sys/block/sda/queue/iosched/* /sys/block/sda/queue/iosched/fifo_batch:16 /sys/block/sda/queue/iosched/front_merges:1 /sys/block/sda/queue/iosched/read_expire:500 /sys/block/sda/queue/iosched/write_expire:5000 /sys/block/sda/queue/iosched/writes_starved:2

/sys/block/sda/queue/iosched/async_depth:192 /sys/block/sda/queue/iosched/prio_aging_expire:10000

Default settings in RHEL 8:

/sys/block/sda/queue/iosched/async_depth:192

Default settings in RHEL 7, 6, 5:

# cat /sys/block/sda/queue/scheduler
noop [deadline] cfq 
$ grep -Hv "zzz" /sys/block/sda/queue/iosched/*
/sys/block/sda/queue/iosched/fifo_batch:16
/sys/block/sda/queue/iosched/front_merges:1
/sys/block/sda/queue/iosched/read_expire:500
/sys/block/sda/queue/iosched/write_expire:5000
/sys/block/sda/queue/iosched/writes_starved:2

Overview

See /usr/share/doc/kernel-doc-*/Documentation/block/deadline-iosched.txt file for overview information. Note: adjusting the scheduler per-disk is a feature of kernels 2.6.10 and above.

'Deadline' is a bit of a misnomer, but close enough: it does not dispatch every io before a deadline -- but it does give additional consideration to io that are past their expiration date.

The deadline io scheduler separates incoming read and write requests into their own queues. It maintains two sets of links for each IO type: a FIFO (time) ordered queue, and a starting sector (elevator) ordered queue. Each IO has a deadline io scheduler specific wrapper around it that has the links to these two queues plus a third hashed queue. The hashed queue is used only for front merging operations.

The deadline scheduler favors reads over writes. The writes_starved tunable parameter (see below) specifies the minimum ratio of read to write favor. The deadline scheduler orders reads and writes into their respective sorted queues and favors pulling io to be dispatched from those sorted queues as opposed to the time ordered FIFO queue. At specific points in its logic, it will also check the companion FIFO (time) ordered queue to see if the IO at the head of the queue, the oldest, has exceeded its expiration date. If it has, then that io is dispatched instead (and then the deadline scheduler uses that point for continuing on within the sorted order queue). For example, if the read sorted queue had the following starting sectors for read io:

Read: Sorted Q -> 1 8 10 36 99 203
      FIFO   Q -> 99 8 10 1 36 203 << the order they arrived in, w/99 "oldest" io.

For now assume there are no writes. The reads would be dispatched in sorted order 1, 8, 10, etc. Lets say that the scheduler checks the expired queue after dispatching the io starting at block 8 and finds that io starting at block 99 has expired -- it is past its due date. The scheduler then selects that io to schedule next. The scheduler then reverts to using the sorted Q. So, in this example the io would be dispatched in 1, 8, 99, 203, 10, 36 order. Note that after the expired IO is dispatched, the sort order links for the io are followed. When the scheduler hits the end of the sorted list, it cycles back around and starts again.

Deadline Uses

Since deadline favors reads over writes, it typically has better throughput for reads than CFQ. This is especially useful for most database environments. Write latency suffers under deadline so if there are significant synchronous writes or issues requiring more predictable write latency, then deadline probably should not be used. However, normal writes are write-behinds. That is, the application sends a write to the kernel where the write data is buffered in cache and write success is returned to the application. The dirty buffer containing the data will be written at a later time. In this case, any additional write latency associated with flushing the write buffer out to disk is hidden from the application.

A side benefit of deadline's additional write latency is typically there is a greater opportunity for write merging by virtue of the write hanging around in the scheduler queue longer.

As with any tuning, the full application suite should be run. A careful, repeatable, benchmark should be used and run after tuning one parameter at a time. Changes in behavior are not always what one might expect. For example: run four dd read streams and one dd write stream under deadline with the attached test script plus the deadline parameter defaults results in the write stream ending first. High write merging is noted in iostat output. Increasing the writes_starved count to 8 and write_expire to 8000ms one might expect a slower write stream. In fact on my test box it ran slightly faster. My guess is that merging and streaming in fifo_batch chunks allowed the write stream to be more efficient than with the default parameter values. This was the opposite of what I expected and an example why we need to use a predictable, repeatable benchmark while tuning. By examining the await field within iostat command output we can see that write latency did increase significantly, in excess of 5 seconds -- but the total wall clock run time of the write dd command decreased by about 10%. Since the scheduler chosen is block device by block device (RHEL 5 and later only), the disks that the database are located on could be setup to use deadline while, for example, transaction logging going to different block devices could use CFQ.

Deadline Tunables

fifo_batch number-of-requests
writes_starved number-of-requests
front_merges 0|1
read_expire milliseconds
write_expire milliseconds
async_depth
prio_aging_expire

fifo_batch number-of-requests

/usr/share/doc/kernel-doc*/Documentation/block/deadline-iosched.txt

RHEL5
When a read request expires its deadline, we must move some requests from the sorted io scheduler list to the block device dispatch queue. fifo_batch controls how many requests we move.

RHEL6 and later
Requests are grouped into batches of a particular data direction (read or write) which are serviced in increasing sector order. To limit extra seeking, deadline expiries are only checked between batches. fifo_batch controls the maximum number of requests per batch.

This parameter tunes the balance between per-request latency and aggregate throughput. When low latency is the primary concern, smaller is better (where a value of 1 yields first-come first-served behaviour). Increasing fifo_batch generally improves throughput, at the cost of latency variation.

The 'fifo_batch' does not control the number of time ordered (fifo) IO are dispatched, rather it controls the number of maximum number of sorted requests (sorted by LBA) are dispatched in a row before a queue selection/re-evaluation will take place. This is the maximum number of dispatched IO between such evaluations. In RHEL5, these batches of IO are streaming IO requests -- that is there is no seek distance between the IO. Any non-zero seek amount would be considered the end of the batching of IO. That limitation was removed in RHEL6 and later kernels -- IO requests are dispatched from the currently selected sorted queue (read or write) until 'fifo_batch' count is reached or the end of the sorted queue is reached.

Once a specific queue is selected/in-use by deadline, either read or write, then the scheduler will stick with that queue until any of the following conditions occur:

hit the end of the sorted queue, which causes the cached 'next' pointer to get set to null[1], or
RHEL5 only: the next io is not "streaming", that is the io just completed ended at block 'n' and the next io doesn't start at 'n+1' breaking streaming[2], or

RHEL6 and later: this check on "streaming", zero seek distance between dispatched IO within a batch, was removed.

the number of io dispatched hits the 'fifo_batch' limit[3].

Notes:

Each time an IO request is dispatched, the next request pointer is updated to the later red/black sector sorted list for the current IO type. That is, the next read or write io with shortest forward seek. If no such io exists, then the next pointer is set to null to force the end of the current batching of dispatches. This next pointer is used on the next call to dispatch a request -- and only is not used when the current batch of IO is exceeds the limits in [2] (RHEL5, seek to next request is non-zero) or [3] (fifo_batch count).

Compare current RHEL 5.11 code with 7.8. We can see in RHEL5 "streaming" of batches of IO are considered at an end if a break in contiguous lba across IO is detected. However, that check was removed in RHEL 6 and later as can be seen in 7.8 source. Now the streaming of io in batches is just controlled by the absolute count value and not dependent on tracking/detecting seeks between IO requests as in prior versions.

RHEL5:
static int deadline_dispatch_requests(request_queue_t *q, int force)
{
:
       if (dd->last_sector != drq->request->sector)
               /* end the batch on a non sequential request */
               dd->batching += dd->fifo_batch;                       <<== break "batch" on non-continguous IO detection
                                                                          (and below fifo_batch count)
RHEL7:

static int deadline_dispatch_requests(struct request_queue q, int force)

{

:

if (rq && dd->batching < dd->fifo_batch)                     <<== break "batch" only when reach fifo_batch limit

/ we have a next request are still entitled to batch */

goto dispatch_request;

Each time an io from the current data direction is dispatched, the next request pointer for that type of IO is advanced. When the end of the sorted queue for that type of IO has been reached, then the next request pointer ends is set as NULL. This effectively ends the current batching of requests and, upon the next iteration of `deadline_dispatch_requests', queue re-evalution selection is performed. The deadline first checks the read queue and then the write queue for any pending requests. The read queue will be chosen first, but only if either there are no queue writes present or the incremented internal state counter 'starved' is below the 'writes_starved' count. This internal state counter is incremented each time queue re-evaluation selects the read queue for sourcing the next set of 'fifo_batch' requests and there are pending writes. That is, the starved counter is only incremented once per set of batched reads, and only if there were queued writes present. Essentially all requests dispatched as part of a 'fifo_batch' set is treated as just 1 IO. The deadline scheduler's internal 'starved' counter is only reset to 0 when the write queue is selected for sourcing the next batch of IO requests. At the end of each 'fifo_batch' set of dispatches, when queue selection is re-evaluated, only at that time is the time ordered fifo queues looked at. This queue re-evalutation only chooses an IO type -- read or write -- not a specific request. Once read or write queue select is chosen, only then is is the top (oldest) of the time ordered queue of that type looked at. If that IO has waited >= to that io types '*_expire' value, then that IO is selected to start the next batch of dispatched requests. That means, expired IO is only chosen once at each queue re-evaluation point within the code. From that point forward, follow-on requests dispatched within the 'fifo_batch' batch of requests are pulled from the sorted queue. So the queue re-evaluation only chooses the next seed request to be dispatched. This is done to help minimize total seek delay across a set of dispatched IO. Any additional expired requests within the fifo queues will have to wait until the next queue re-evaluation for considersation.

Why would streaming IO not be merged? The nominal answer is because the 512k max limit has been reached for a single io and no further merging can be performed. One of the easiest places to see this is to run four read dd commands and one write dd command. The write dd command will get starved because of the write_starves ratio. This dramatically increases write latency, but what you typically see is huge io merge numbers in the write stream (use iostat -x and look at wrqm/s, write requests merged per second). You'd expect in this environment that even with all the merging, that writes would get starved off and therefore the write dd command would take a long time. In fact, its the first to complete. This is a combination of factors, but the biggest two are that write starvation allows much more merging of requests and streaming of io once the write queue is selected.

NOTES:
[1] a batch of streamed io only count as a single io for the purposes of write starvation, therefore a maximum of writes_starved * fifo_batch reads could be issued before the write queue gets selected
[2] the expired read or write queue is not checked while streaming is occuring

front_merges 0|1

/usr/share/doc/kernel-doc*/Documentation/block/deadline-iosched.txt

Sometimes it happens that a request enters the io scheduler that is contiguous with a request that is already on the queue. Either it fits in the back of that request, or it fits at the front. That is called either a back merge candidate or a front merge candidate. Due to the way files are typically laid out, back merges are much more common than front merges. For some work loads, you may even know that it is a waste of time to spend any time attempting to front merge requests. Setting front_merges to 0 disables this functionality. Front merges may still occur due to the cached last_merge hint, but since that comes at basically 0 cost we leave that on. We simply disable the rbtree front sector lookup when the io scheduler merge function is called.

Nominally IO will tend to be issued such that back merges are the norm. A back merge simply means the current io is at 'n+1' while the previous IO ended at block 'n'... therefore the two IO can be merged. A front merge is just the opposite, the current IO is "in front of" the previous IO and is ending at 'b-1' where the previous IO started at block 'b'.

                      +-------------------+ +---------------+
                      | last IO           | | next io       |   << back merge case
                      +-------------------+ +---------------+
                      b . . . . . . . . . n n+1 . .  .    .
        
    +---------------+ +-------------------+
    | next io       | | last IO           |                     << front merge case
    +---------------+ +-------------------+
       .   .  . . b-1 b . . . . . . . . . n

The deadline scheduler maintains multiple queues. The sorted queue is a red-black queue and permits easy, predictable time sortings and allows walking the queue in sorted order front to back. Back merges cost very little in terms of time cost. The sort queue is ordered by starting block. To back merge efficiently, the code needs to access an io sorted by ending block. For this purpose a hashed queue is utilized. The hashed queue has 32 buckets and is hashed by ending block+1. To front merge, the starting block of the current io is used hashed to select one of the buckets, and then all io on that bucket chain is traversed/compared to the current starting block of the current IO.

Since front merge costs are higher and they nominally don't occur that often, turning off 'front_merges' can improve performance in many environments. As with everything else, however, tuning should be done by running the application mix using a repeatable benchmark and measuring/comparing performance numbers with 'front_merges' on and off.

read_expire milliseconds

/usr/share/doc/kernel-doc*/Documentation/block/deadline-iosched.txt

The goal of the deadline io scheduler is to attempt to guarantee a start service time for a request. As we focus mainly on read latencies, this is tunable. When a read request first enters the io scheduler, it is assigned a deadline that is the current time + the read_expire value in units of milliseconds.

The 'read_expire' time is the maximum time in milliseconds after which the read is considered 'expired'. Think of this more like the expiration date on a milk carton. The milk is best used before the expiration date. The same with the deadline scheduler. It will NOT attempt to make sure all IO is issued before its expiration date. However, if the IO is past expiration, then it gets a bump in priority.... with caveats.

The read expiration queue is ONLY checked when the deadline scheduler re-evaluates read queue. For reads this means every time a sorted read is dispatched EXCEPT for the case of streaming io. While the scheduler is streaming io from the read queue, the read expired is not evaluated.When re-evaluating the read queue, the logic is

check for expired reads (look at head of FIFO [time ordered] queue)
check to see if cached read pointer valid (so even if not streaming, the cached pointer still takes precedence so the sorted queue is traversed tip to tail in a sweep)
pick up the first read from the sorted queue (start at the tip again for another sweep)

If there are expired reads, then the first one is pulled from the FIFO. Note that this expired read then is the new nexus for read sort ordering. The cached next pointer will be set to point to the next io from the sort queue after this expired one.... The thing to note is that the algorithm doesn't just execute ALL expired io once they are past thier expiration date. This allows some reasonable performance to be maintained by batching up 'write_starved' sorted reads together before checking the expired read queue again.

So, the maximum number of io that can be performed between read expired io is 2 * 'fifo_batch' * 'writes_starved'. One set of 'fifo_batch' streaming reads after the first expired read io and if this stream happened to cause the write starved condition, then possibly another 'fifo_batch' streaming writes. This is worse case, after which the read expired queue would be re-evaluated. At best, the expired read queue will be evaluated 'write_starved' times in a row before being skipped because the write queue would be used.

write_expire milliseconds

/usr/share/doc/kernel-doc*/Documentation/block/deadline-iosched.txt

Similar to read_expire mentioned above, but for writes.

The write_expire time is the maximum time in milliseconds after which the write is considered 'expired'. Again, this is like the expiration date on a milk carton. The deadline scheduler will NOT attempt to make sure all IO is issued before its expiration date. However, if the IO is past expiration, then it gets a bump in priority.... with caveats.The write expiration queue is ONLY checked when the deadline scheduler re-evaluates the write queue. This means after every 'writes_starved' reads at best. Increasing 'writes_starved' will tend to increase the backlog of writes and the likelyhood of an expired write. The good news is this increased latency means the io is hanging around in the queue much longer and has a better opportunity to be merged with other writes. Since writes tend to be accumulated in kernel buffer cache, the write latency to disk is often not reflected into the application. Normally an application issues a write and as soon as the write data is transferred into kernel buffer cache, a write complete is handed back to the application. This won't happen for direct io however, so if your application is using synchronous write direct io then it should be carefully benchmarked before using the deadline scheduler.

The worse case is 3 * 'fifo_batch' * 'writes_starved' io being performed between each evaluation of the expired write queue -- a 'fifo_batch' stream of writes, followed by two 'fifo_batch' streams of reads before another writes starved condition is met.

For writes, the expiration queue will be checked on average every 'write_starved' io. For the default, this means every 3rd io. This assumes that some merging of reads is taking place, but because read latency should be fairly low in deadline there won't be enough reads accumulating such that long sequences of streamed reads is taking place. This may be an incorrect assumption in certain application workloads, so careful benchmarking and tuning should be done on any critical application environment.

Running with defaults for deadline and running four dd read streams and one dd write stream showed write latency (iostat's await time) on the order of 2 seconds.For writes this means every time a sorted read is dispatched EXCEPT for the case of streaming io. While the scheduler is streaming io from the read queue, the read expired is not evaluated.

writes_starved number-of-requests (count)

/usr/share/doc/kernel-doc*/Documentation/block/deadline-iosched.txt

When we have to move requests from the io scheduler queue to the block device dispatch queue, we always give a preference to reads. However, we don't want to starve writes indefinitely either. So writes_starved controls how many times we give preference to reads over writes. When that has been done writes_starved number of times, we dispatch some writes based on the same criteria as reads.

The writes_starved is a counter that defines a nominal read to write ratio. The default value is 2. With the default, at least 2 reads will be dispatched before the write queue is evaluated. At most 'fifo_batch' * 'writes_starved' reads can occur before a write is dispatched. This is because while streaming io, up to a 'fifo_batch' number of io can be dispatched and this batch of io is only counted once towards the write starved counter.

Increasing the writes_starved to 4 in the four dd reads/one dd write stream actually improved the time to complete for the write stream. This can seem counter intuitive, but in this case the additional write latency induced by the larger read:write ratio gave the write queue more time to accumulate merged and streaming io. {at least that is my guess in looking at the iostat data}.

So, again careful tuning of this parameter in conjunction with the others is important.

async_depth number-of-async-requests allowed

This tunable was added as part of adding the mq-deadline io scheduler to RHEL 8. This description applies to both RHEL 8 and 9.

The async_depth defines a limit as to the maximum number of async requests allowed. This number should always be less than the nr_requests value for the device. Thus, async_depth provides a throttle mechanism for asynchronous io requests and writes such that these requests do not block the allocation of synchronous requests.

Typically asynchronous requests are decoupled from applications awaiting the completion of the io, whereas synchronous io typically has an application or kernel process awaiting its completion. This makes synchronous io have an inherent higher priority than asynchronous io requests. Given the blocking nature of synchronous io, there should be much fewer synchronous io requests present within the lower io stack within the kernel vs asynchronous requests. Still this allows a mechanism to allow a resonable number of io requests be reserved for synchronous io.

As a practical matter, this prevents things like kernel dirty cache page flushes from consuming all available nr_requests slots within the device.

The default nr_requests value is 256, and the async_depth default is 192. The 192 is current the maximum default value this parameter is set to. For example, changing nr_requests to 512 still leaves async_depth at 192, but setting nr_requests to just 64, results in async_depth being automatically lowered to just 40.

prio_aging_expire milliseconds

This tunable was added as to the mq-deadline io scheduler as part of the RHEL 9 kernel release.

Originally the internal queues of the deadline scheduler were just the fifo (time ordered) queue and elevator sort queues. The elevator sort queue included separate read and write queues.

However, all io also have a priority class of either none, run-time (RT), best effort (BE), or idle. When prio_aging_expire was added, the single fifo (time ordered) queue was broken into two separate fifo queues. All idle priority io are placed into the second separate fifo queue, while none, run-time and best effort priority remain in the first fifo queue. This in essence boosts non-idle priority class io to get processed sooner by postponing and deferring any idle class priority io to a later time.

Much like read_expire and write_expire, the prio_aging_expire is used to prevent the idle class io within the new second fifo queue from being continually bypassed. If the idle class io at the top of the second fifo queue is older than prio_aging_expire then those idle io are then processed before any other io.

This allows giving additional priority to non-idle priority io, but also prevents complete starvation of any idle io that is present.

The default value is 10 seconds (10,000 milliseconds).

Common Q/A

Q. In my database driven application environment, the db read io traffic needs to be completed in 500ms or less. Can I use deadline to guarentee that?
A. No. There are no guarentees as to completion time. What the deadline scheduler will do, is improve read performance at the expense of write latency which is very often a big win in database environments.

Q. If the read_expire is set to 500ms within the deadline scheduler, will all read IO be dispatched in 500ms or less?
A. No. What the deadline scheduler will do is increase the priority of expired reads. So if a read io hasn't been dispatched to the device driver in 'read_expire' time, then it is given a higher priority within the scheduler. The recommendation in this case would be to set the read_expire to 350-450ms to increase the likelyhood that the io will be dispatched in under 500ms. However, there are still no guarentees on dispatch time.

Q. If there are four expired IO within the queue, will they be dispatched before any other IO?
A. No. It depends on a number of factors as to exactly when the IO will be dispatched. However, expired IO take precendence only when the queue is evaluated. For reads, this is after every batch/stream of io. For writes, this is after every writes_starved number of read io or batch/streams of read io which ever comes last.

Q. Is the following true?

The deadline scheduler reorders I/O to optimized disk heads movement and caps maximum latency per request to prevent resource starvation for I/O intensive processes. The deadline scheduler is recommended for single instance databases with dedicated storage.

A. No. As per the documentation "The goal of the deadline io scheduler is to attempt to guarantee a start service time for a request." 'Attempt' is the keyword. The latency maximium is "soft" capped on in the sense that IO past the cap is labeled expired and will have improved priority for future scheduling. For example, deadline does not send any/all expired io to the dispatch queue immediately upon the io becoming expired. Rather deadline will send the first expired io upon evaluating the queue. For reads, this is followed by at least 'writes_starved'-1 other io from the sorted order following the expired io. These sorted order io may not be expired. At most 2 * 'fifo_batch' additional io can be dispatched before the next expired read is dispatched. And another distinction that is lost in the above statement: the scheduler only tracks latency within the scheduler. Once dispatched to the device driver, there are transport latencies, controller latencies, and finally target disk latencies that are outside of the scheduler latency.

Deadline Scheduler I/O Logic Flow

The following is a logic outline of the deadline_dispatch_request() function. This function chooses the next io to be dispatched from within the scheduler. There are two flow diagrams attached. The first tries to represent the code logic and the second the data flow logic at a higher level.

deadline_dispatch_request() processing

Notes:

there are two 'cached' entries, one for read and one for write within the deadline scheduler state for a device. These entries contain the next 'type' request from the sorted queue after the last dispatched one. Only one cached['type'] entry will be valid at any given time.
there is a 'batched' count. As long as we are pulling from the same queue 'type' this will continue to be incremented. If it is over the 'fifo_batch' limit then a complete queue selection logic is invoked rather than blindly using the cached['type'] entry. But note that if we select the same queue and there are no expired requests then this cached entry will still be used. If we didn't, the logic would dictate that we restart at the beginning of the sorted list again. See below for more details.
there is a 'starved' write count. This is the maximum number of reads that can be sent in a row while writes are waiting before writes are given a chance to be scheduled. But note: if we are streaming read ios, then each stream of ios up to 'fifo_batch' will count only once against the 'starved' count.

1. Streaming Logic

At this point either there were no reads in the sort queue or too many previous reads from the sorted queue have bypassed pending writes. Clear the starved count and go to the generic selection logic (selecting from the write queue).

2A. Read Q Select Logic

If selecting a new queue, check reads first... always. If there are waiting writes, then bump starved count. If the awaiting writes have been bypassed at least 'writes_starved' times, then ignore the read queue. So, if there are no writes, then starved count is not bumped. Select from the 'read' queue (See generic 'select' logic below as it is the same for both queues).

2B. Write Q Select Logic

3. Request Select Logic (Generic: either read/write)

The queue type is set based upon how we got here. Either 'read' if there are reads in the queue and we're below the write starved limit or the 'write' queue otherwise. First check to see if there are any expired io within this type's fifo queue. If there are, select the top one from the time ordered fifo as the one to be dispatched. Otherwise, did we cache the next sorted io of this type in the state? We *can* get here because we've exceeded 'fifo_batch' limit, but then we ended up selecting the same queue to dispatch from. If so, select that the cached one to be dispatched. Note: in this case the batching count is NOT reset. But it doesn't matter too much. This essentially allows us to continue 'batching' io from one queue until an io arrives within the other queue. It also means we don't reset back to the beginning of the sorted queue... If the cached entry isn't set, either we've reached the end of the sorted queue OR we've switched queues. In this case, select the first request off the sorted queue.

Note this has a somewhat peculiar result, assuming there isn't any streaming and there are always waiting read and writes to be dispatched...:

read
read  (cached)
write (select from front of sort queue, due to starvation)
read  (select from front of sort queue, different queue than last time)
read  (cached)
write (select from front of sort queue, due to starvation)
read  (select from front of sort queue, different queue than last time)
read  (cached)
:`
.

We tend to fixate on the front of the sorted queue. Only when an expired io occurs on the selected queue do we change our selection point. For reads w/o streaming this means we'll take the expired request and the next sorted one after that before reverting to front of the sort queue: first for writes and then back to reads. With high read loading, it may be better to change the ratio to 3:1 or 4:1 because even if we aren't streaming, we may get better locality of reference out on the disk/storage controller.This also may mean that on a mixed io loading as more and more writes as a percentage are added to the mix, the more likely we'll be selecting write io becuase of an expiration time.

4. Dispatch Request

Bump the dispatch count, remove the request from all the internal queues within the deadline scheduler (there are three for each request). Call elevator.c:elv_dispatch_add_tail() routine.

Some examples:

. cached read set? y.  streaming? n.  :: reselect which queue to use
. select read queue (no writes)
. use state cached pointer to next read (unless there are expired reads)
. batching++
. dispatch (resets cached next for this queue (read))... repeat
  --
. cached read set? y.  streaming? y.  batching over limit? y.
. select read queue (no writes)
. use state cached pointer to next read (unless there are expired reads)
. batching++
. dispatch (resets cached next for this queue (read))... repeat
... when a write shows up, the read queue will still be selected for
   immediately, clear starved count... the cached next write will be
   cleared assuming no more writes and we'll reset the batching count.
   The same logic applies to writes with no reads...
 
 
. expired reads? y.
. clear batching count
. get 1st off of fifo
. dispatch(resets cached next for this queue(read))...
 
 
Note: what this means is that once we switch localities 
      (sort..sort..sort..expire->next=new sort point,sort...)
      Starved count kicks in...
 
 
. read
. read
. write (choose first off of sort list unless there are expired io)
. next write streaming? y. ok write... repeat until 'fifo_batch' count hit
. next write streaming? n. ok queue select
. read  (choose first off of sort list)
. read  (next)
. write (choose first off of sort list unless there are expired io)

Kernel Internal Data Structure(s)

While the io scheduler is under kabi control so that it doesn't change across minor versions of a release, the schedule personalities (elevator types) of deadline, noop, cfq, and the like are not under kabi control and can have their data structures change between minor releases. While this happens often with cfq, deadline data structures are fairly stable.

Each /dev/sdN device has a gendisk structure. Each gendisk points to an associated request_queue structure:


genhd.h::
RHEL6
struct gendisk {
        /* major, first_minor and minors are input parameters only,
         * don't use directly.  Use disk_devt() and disk_max_parts().
         */
        int major;                      /* major number of driver */
        int first_minor;
        int minors;                     /* maximum number of minors, =1 for
                                         * disks that can't be partitioned. */

        char disk_name[DISK_NAME_LEN];  /* name of major driver */
        char *(*devnode)(struct gendisk *gd, mode_t *mode);
        /* Array of pointers to partitions indexed by partno.
         * Protected with matching bdev lock but stat and other
         * non-critical accesses use RCU.  Always access through
         * helpers.
         */
        struct disk_part_tbl *part_tbl;
        struct hd_struct part0;

        const struct block_device_operations *fops;
        struct request_queue *queue;
:

The request_queue is the main io scheduler data structure and points to an elevator_queue structure that contains elevator type information, like pointers to elevator functions for the type in use with this request_queue.


blkdev.h:: 
RHEL6
struct request_queue
{
        /*
         * Together with queue_head for cacheline sharing
         */
        struct list_head        queue_head;
        struct request          *last_merge;
        struct elevator_queue   *elevator;
        :

The elevator queue structure contains a pointer to list of required function pointers and other information. The list of functions is common across all elevator types, but the code behind those functions is specific to the current scheduler personality in use by the request queue (cfq, deadline, noop, and others). Additional information is included within the elevator_type structure including elevator name (string of "deadline" for example), along with a pointer to an elevator private data structure called elevator_data.


elevator.h::
RHEL6
struct elevator_queue
{
        struct elevator_ops *ops;
        void *elevator_data;
        struct kobject kobj;
        struct elevator_type *elevator_type;
        struct mutex sysfs_lock;
        struct hlist_head *hash;
:
};

/*
 * identifies an elevator type, such as AS or deadline
 */
struct elevator_type
{
        struct list_head list;
        struct elevator_ops ops;
        struct elv_fs_entry *elevator_attrs;
        char elevator_name[ELV_NAME_MAX];         "deadline"
        struct module *elevator_owner;
};

Up until this point, all the data structures referenced are under kabi control: they cannot change definition between minor versions only across major releases. However note that the elevator_data is a void type! The elevator_data points to a private data structure that is passed to the elevator functions, but is private (non-kabi) to the elevator type (scheduler personality). Becuase its private to the code of the elevator type, its free to change definition between minor releases.

We can see the elevator_queue structure (or its equivalent in RHEL 4 and 5) structure recieve the private elevator_data pointer within the appropriate elevator type initialization routine. For deadline:


RHEL4
deadline-iosched.c::
/*
 * initialize elevator private data (deadline_data), and alloc a drq for
 * each request on the free lists
 */
static int deadline_init(request_queue_t *q, elevator_t *e)
{
        struct deadline_data *dd;
:
        dd->fifo_expire[READ] = read_expire;
        dd->fifo_expire[WRITE] = write_expire;
        dd->writes_starved = writes_starved;
        dd->front_merges = 1;
        dd->fifo_batch = fifo_batch;
        e->elevator_data = dd;
        return 0;
}

RHEL5
/*
 * initialize elevator private data (deadline_data).
 */
static void *deadline_init_queue(struct request_queue *q)
{
        struct deadline_data *dd;
:
        dd->fifo_expire[READ] = read_expire;
        dd->fifo_expire[WRITE] = write_expire;
        dd->writes_starved = writes_starved;
        dd->front_merges = 1;
        dd->fifo_batch = fifo_batch;
        return dd;
}

RHEL6
/*
 * initialize elevator private data (deadline_data).
 */
static void *deadline_init_queue(struct request_queue *q)
{
        struct deadline_data *dd;
:
        dd->fifo_expire[READ] = read_expire;
        dd->fifo_expire[WRITE] = write_expire;
        dd->writes_starved = writes_starved;
        dd->front_merges = 1;
        dd->fifo_batch = fifo_batch;
        return dd;
}

...so deadline_init() sets up the private deadline_data structure with its defaults. In RHEL 4 it then updates the elevator_t structure with the pointer to this instance of deadline scheduler data. In RHEL 5 and 6, the caller to deadline_init_queue() recieves the deadline_data structure address upon return and updates the elevator_queue structure.


RHEL6
int elevator_init(struct request_queue *q, char *name)
{
        struct elevator_queue *eq;
:
        data = elevator_init_queue(q, eq);  return eq->ops->elevator_init_fn(q)
:
        elevator_attach(q, eq, data);       eq->elevator_data = data
        return 0;
}

The good news for deadline, is that the elevator private deadline_data structure is fairly simple and as such is fairly stable and hasn't changed between RHEL 4, 5, or 6. Differences between the major releases are underlined:


deadline-iosched.c::
RHEL4
struct deadline_data {
        /*
         * run time data
         */

        /*
         * requests (deadline_rq s) are present on both sort_list and fifo_list
         */
        struct rb_root sort_list[2];
        struct list_head fifo_list[2];

        /*
         * next in sort order. read, write or both are NULL
         */
        struct deadline_rq *next_drq[2];
        struct list_head *dispatch;     /* driver dispatch queue */
        struct list_head *hash;         /* request hash */
        unsigned int batching;          /* number of sequential requests made */
        sector_t last_sector;           /* head position */
        unsigned int starved;           /* times reads have starved writes */

        /*
         * settings that change how the i/o scheduler behaves
         */
        int fifo_expire[2];
        int fifo_batch;
        int writes_starved;
        int front_merges;

        mempool_t *drq_pool;
};


RHEL5
struct deadline_data {
        /*
         * run time data
         */

        /*
         * requests (deadline_rq s) are present on both sort_list and fifo_list
         */
        struct rb_root sort_list[2];
        struct list_head fifo_list[2];

        /*
         * next in sort order. read, write or both are NULL
         */
        struct deadline_rq *next_drq[2];
        struct hlist_head *hash;        /* request hash */
        unsigned int batching;          /* number of sequential requests made */
        sector_t last_sector;           /* head position */
        unsigned int starved;           /* times reads have starved writes */

        /*
         * settings that change how the i/o scheduler behaves
         */
        int fifo_expire[2];
        int fifo_batch;
        int writes_starved;
        int front_merges;

        mempool_t *drq_pool;
};

RHEL6
struct deadline_data {
        /*
         * run time data
         */

        /*
         * requests (deadline_rq s) are present on both sort_list and fifo_list
         */
        struct rb_root sort_list[2];
        struct list_head fifo_list[2];

        /*
         * next in sort order. read, write or both are NULL
         */
        struct request *next_rq[2];
        unsigned int batching;          /* number of sequential requests made */
        sector_t last_sector;           /* head position */
        unsigned int starved;           /* times reads have starved writes */

        /*
         * settings that change how the i/o scheduler behaves
         */
        int fifo_expire[2];
        int fifo_batch;
        int writes_starved;
        int front_merges;
};

Diagnostic Steps

Sample test scripts.

Setup:

mkdir /tmp/test1
mkdir /tmp/test2
mkdir /tmp/test3
mkdir /tmp/test4
mkdir /tmp/test5
echo make test files...
time dd if=/dev/zero of=/tmp/test1/file.t bs=1024k count=2000
time dd if=/dev/zero of=/tmp/test2/file.t bs=1024k count=2000
time dd if=/dev/zero of=/tmp/test3/file.t bs=1024k count=2000
time dd if=/dev/zero of=/tmp/test4/file.t bs=1024k count=2000
sync

Test (four dd read streams, one dd write stream)

cat /sys/block/sda/queue/iosched/fifo_batch
cat /sys/block/sda/queue/iosched/writes_starved
cat /sys/block/sda/queue/iosched/read_expire
cat /sys/block/sda/queue/iosched/write_expire
echo 3 > /proc/sys/vm/drop_caches
sync
sync
sleep 5
iostat 10 30 -t -k -x >& ./iostat-test.log &
top -b -d 10 -n 30    >& ./top-test.log &
vmstat 10 30          >& ./vmstat-test.log &
time dd if=/tmp/test1/file.t of=/dev/null bs=64k &
time dd if=/tmp/test2/file.t of=/dev/null bs=64k &
time dd if=/tmp/test2/file.t of=/dev/null bs=64k &
time dd if=/tmp/test4/file.t of=/dev/null bs=64k &
time dd if=/dev/zero         of=/tmp/test5/file.t bs=64k count=32002
time sync

Article Type

General