How does the kernel determine minimum_io_size and optimal_io_size?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6

Issue

  • How does the kernel derive the values for minimum_io_size and optimal_io_size?
  • Can we change the minimum_io_size and optimal_io_size?
    • How do we change optimal_io_size?
  • Enabling storage vendor host mode 68 ("Support Page Reclamation for Linux") in our storage system (HDS VSP G200) causes the I/O size to change from "I/O size (minimum/optimal): 512 bytes / 512 bytes" to "I/O size (minimum/optimal): 65536 bytes / 44040192 bytes". Why does the HDS storage box change its reported I/O size to 65536/44040192 when storage vendor host mode 68 is turned on?

Resolution

  • The minimum_io_size and optimal_io_size values for /dev/sdX devices are retrieved by the kernel by inquiring the storage vendor-provided I/O Limits within the device VPD pages.
  • These values cannot be changed by the user.

Root Cause

These two parameters do not specify I/O sizes directly, but rather indicate various I/O Limits obtained from the storage device and stored within kernel sysfs tree. These limits are then used by various tools such as parted to calculate proper partition alignment, and propagated up the stack to upper layer tools such as LVM and mkfs to apply calculated values such as chunk size, etc. The optimal granularity, for example, is used for aligning partitions and by the filesystem for aligning internal data structures. Changing to a non-optimal size means there could be non-optimal performance. RHEL may not be able to always achieve the requested optimal granularity on a per I/O basis, but at least aligning data structures to that value should be done in order to give I/O the best chance at optimal performance.

These two parameters are read-only sysfs parameters - users cannot change them.

$ ls -l /sys/block/sda/queue/*io_size
-r--r--r-- 1 root root 4096 Feb  4 13:14 /sys/block/sda/queue/minimum_io_size
-r--r--r-- 1 root root 4096 Feb  4 13:14 /sys/block/sda/queue/optimal_io_size

If you want to change the maximum (issued) I/O size, which is different from above, then changing parameter /sys/block/sdX/queue/max_sectors_kb will do that. This specifies the maximum I/O size that the host will issue to the target storage. Typically it is 512KiB (1024 sectors), although for USB devices it is a smaller default value. But ultimately, its up to the application to specify the I/O size -- for example if the application is performing 4KiB random io, then 4KiB random I/O will mostly be issued from the host to storage as there is little opportunity with this I/O pattern to merge such I/O into larger I/O.

NOTE:
If the sd devices are beneath a dm(multipath) device, the LCM (Content from en.wikipedia.org is not included.least common multiple) of the associated optimal I/O sizes will be assigned to the dm(multipath) device as it's optimal_io_size. If wanting to change the optimal_io_size on a dm(multipath) device add parameter max_sectors_kb to the /etc/multipath.conf file for the specific storage array (and change the underlying sd devices using an udev rule).

  • max_sectors_kb is caped by the lessor of /sys/block/$i/queue/max_hw_sectors_kb and sg_inq -p 0xb0 Maximum transfer length: x sg_readcap Logical block length.

Additional References:
What is the kernel parameters related to maximum size of physical I/O requests?
How to set custom 'max_sectors_kb' option for devices without causing any disruption to ongoing IO operations?

Diagnostic Steps

For a general discussion of I/O hints, see I/O limits notes.

If not installed, install sg3_utils package and run the following commands:

# sg_readcap -16 /dev/sdX
# sg_inq         /dev/sdX
# sg_inq -p 0x00 /dev/sdX    << To establish whether VPD B0h page is supported for next cmd
# sg_inq -p 0xb0 /dev/sdX    << Block limits (sbc2) page

Change sdX within the above commands to the device in question. Review the output. The minimum_io_size and optimal_io_size settings are presented by storage and retrieved/set by the kernel using SCSI function calls which are also used by the above sg_ commands. We will verify the settings using the output from the above commands to answer this question:

"I/O size (minimum/optimal): 65536 bytes / 44040192 bytes". The question is why the I/O size becomes 65536/44040192 when host mode 68 is turned on within the storage box?"

/sys/block/<disk>/queue/minimum_io_size    << from OPTIMAL TRANSFER LENGTH GRANULARITY, scsi INQUIRY page 0xb0 and scsi READ CAPACITY(16)
/sys/block/<disk>/queue/optimal_io_size    << from OPTIMAL TRANSFER LENGTH            , scsi INQUIRY page 0xb0 and scsi READ CAPACITY(16)

For example,

# sg_inq -p 0xb0 /dev/sdacn
VPD INQUIRY: Block limits page (SBC)
  Optimal transfer length granularity: 1 blocks   << [1] block-size x this field = minimal_io_size, block-size defined in READCAP command below.
  Maximum transfer length: 8192 blocks
  Optimal transfer length: 8192 blocks            << [2] block-size x this field = optimal_io_size
  Maximum prefetch, xdread, xdwrite transfer length: 0 blocks


# sg_readcap -16 /dev/sdacn
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Thin provisioning: tpe=0, tprz=0
   Last logical block address=4194303 (0x3fffff), Number of logical blocks=4194304
   Logical block length=512 bytes                << [3] this is block size (length), multiplier for above fields.
   Logical blocks per physical block exponent=0
   Lowest aligned logical block address=0
Hence:
   Device size: 2147483648 bytes, 2048.0 MiB, 2.15 GB

...and we indeed do see those numbers here:

# grep -v "zz" /sys/block/sdacn/queue/*io_size
/sys/block/sdacn/queue/minimum_io_size:512       << [1] 1 block     x [3] 512-bytes/block =     512 bytes.
/sys/block/sdacn/queue/optimal_io_size:4194304   << [2] 8192 blocks x [3] 512-bytes/block = 4194034 bytes.

So the short answer as to where these values come from or why those values change when changing configuration modes within storage is because storage provides these numbers to the host as shown via the above SCSI commands and subsequent calculations that result in the values seen.

This does hint to the fact that it is possible to see different optimal_io_size for the same LUN across different access paths. This can happen (from storage controller configuration issues, manual changes (echo) or a trigger of udev rules), which could cause an exceptionally high optimal_io_size for the dm(multipath) device or paths which will not function. If this is seen, and the sg_readcap and sg_inq is lining up with what is in /sys, it is likely due to a storage controller configuration issue, and the system's SAN vendor should be contacted for analysis. Below is an example of what this would look like, noting that 4278190080 is the LCM (Content from en.wikipedia.org is not included.least common multiple) of 4177920 and 16777216:

# multipath -ll disk1
disk1 (wwid_entry_omitted) dm-108 3PARdata,VV
size=80G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 0:0:0:152 sdcg 69:64   active ready running
  |- 1:0:0:152 sdjc 8:352   active ready running
  |- 0:0:1:152 sdfr 130:208 active ready running
  `- 1:0:1:152 sdmn 69:496  active ready running

# for i in dm-108 sdcg sdjc sdfr sdmn ; do cat /sys/block/$i/queue/optimal_io_size ; done
4278190080     << LCM of 16777216 and 4177920
16777216       << Optimal transfer length: 32768 blocks x Logical block length=512 bytes
16777216       << Optimal transfer length: 32768 blocks x Logical block length=512 bytes
4177920        << Optimal transfer length: 8160 blocks x Logical block length=512 bytes
4177920        << Optimal transfer length: 8160 blocks x Logical block length=512 bytes

See the following for other knowledge documents referencing issues related to optimal_io_size:

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.