How to check and measure FC HBA utilization?
Environment
- Red Hat Enterprise Linux (RHEL)
- FC HBA adapter such as Qlogic or Emulex.
Issue
- How do I to check and measure FC HBA utilization?
- How do I determine how much io is going through an HBA?
- How do I determine if an HBA is being overloaded with io?
- How do I determine if an HBA has reached its available capacity limit?
- We want to monitor the data transfer/sec with each Fibre Channel Adapter
Resolution
- Red Hat does not provide any tool for measuring
HBAutilization or throughput. Individual vendors may provide such tool(s) for their HBAs, but it is not provided nor supported by Red Hat.
Root Cause
- The linux tools such as iostat, sar, etc. are block device based and not scsi address (h:c:t:l) nor storage type nor storage topology aware.
Diagnostic Steps
- One way of achieving this is to write a simple program that creates a pseudo hostN device and aggregates all sdN devices within an iostat sample that share the same host adapter into said hostN device.
NOTE: the lsscsi command used below is from the lsscsi*.rpm package. It may need to be installed in order to be available.
- For example, lsscsi output shows which devices share the same host adapater.
$ lsscsi [0:0:0:0] disk HITACHI OPEN-V 7005 /dev/sda [0:0:0:1] disk HITACHI OPEN-V 7005 /dev/sdb [0:0:0:2] disk HITACHI OPEN-V 7005 /dev/sdc [0:0:0:3] disk HITACHI OPEN-V 7005 /dev/sdd : [0:0:0:331] disk HITACHI OPEN-V 7005 /dev/sdyk [0:0:0:332] disk HITACHI OPEN-V 7005 /dev/sdyl [2:0:0:0] disk HITACHI OPEN-V 7005 /dev/sdln [2:0:0:1] disk HITACHI OPEN-V 7005 /dev/sdlo [2:0:0:2] disk HITACHI OPEN-V 7005 /dev/sdlp : .
-
All sdN devices with a scsi address of 0:*:*:* all go through the same adapter. All sdN devices with 2:*:*:* go through a 2nd HBA. This type of information can be used to aggregate sdN iostat information into a hostN psuedo-device sample line.
-
We can see and illustrate this type of issue with sar using a simple test system. Essentially sar double counts things because it doesn't understand the underlying storage topology all that well. First, which devices are present on this test system:
$ iostat -tkx 1 1 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 9.00 0.00 2.00 0.00 44.00 44.00 0.01 3.00 0.00 3.00 3.00 0.60 sdd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd6 0.00 9.00 0.00 2.00 0.00 44.00 44.00 0.01 3.00 0.00 3.00 3.00 0.60 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 10.00 0.00 40.00 8.00 0.03 3.00 0.00 3.00 0.30 0.30 dm-2 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 3.00 0.00 3.00 3.00 0.30
- So there are two disks, each with multiple partitions plus a couple device mapped devices (lvm in this case). And from sar output for this test system:
00:00:01 tps rtps wtps bread/s bwrtn/s 01:29:01 95.03 0.13 94.90 1.07 1495.85 :
- Now looking at the data for the individual devices that were summed into the above 01:29:01 sample:
00:00:01 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 01:29:01 dev8-0 4.20 0.53 701.58 167.14 0.06 14.33 0.92 0.39 (sda) 01:29:01 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-5 4.20 0.53 701.58 167.14 0.06 14.33 0.92 0.39 01:29:01 dev8-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-16 3.07 0.00 92.68 30.22 0.13 42.78 1.07 0.33 (sdb) 01:29:01 dev8-17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-19 3.07 0.00 92.68 30.22 0.13 42.78 1.07 0.33 01:29:01 dev8-20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev8-21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev253-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 (dm-0) 01:29:01 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:29:01 dev253-2 3.20 0.00 25.60 8.00 0.00 0.22 0.07 0.02 01:29:01 dev253-3 84.56 0.53 675.98 8.00 5.42 64.15 0.04 0.36
- You will note that dev253-* are dm-* devices. The dm-* IO statistics are covered (duplicated) in the sda+sdb output. Specifically dm-* are mapped on top of sda and its partitions. The sum of tps for dev253-* devices (dm-*) totals 87.76. But this represents IO requests before being merged within the IO scheduler and passed on to the underlying device. These 87.76 tps are equal to 4.20 tps in sda. We can confirm this by looking at the totals of rd_sec/s and wr_sec/s: dev253 rd = 0.53, same as sda; dev253 wr = 701.58 again same as sda. So although the IO within dev253-* were smaller and more numerous, the total amount of data transferred is equal as expected. But lets refocus on the sar total. It indicates that 95.03 tps was measured. Adding up sda + sdb[1] in this system shows 7.27 tps -- so where did the additional tps come from? The tps total of sda + sdb + dm-* = 7.27 + 3.2 + 84.56 = 95.03. Essentially sar is double counting. The same goes for data transferred. The total blocks read/second is 0.53 on sda, sdb was idle. But sar just totals all numbers, adding in the dev253-3's 0.53 and coming up with 1.07 -- double the actual value transferred to disk. Its a common problem with sar and many other 3rd party tools in not understanding that dev253 devices are mapped on top of other devices/disks. You could argue that the totals are correct, they are a sum of all io activity and that the real problem only comes into play when then that total is interpreted as the total amount of io to disks (sdN).
[1] sar knows dev8-1 - dev8-6 are partitions under dev8-0 due to the minor numbers, just as it knows dev8-17 thru dev8-21 are 5 paritions under dev8-16 (sdb). It won't double count partition data into the totals.
- Also note that dev8 and dev253 aren't the only major numbers, for example:
04:00:02 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util : 04:10:01 PM dev104-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 04:10:01 PM dev104-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 04:10:01 PM dev104-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 04:10:01 PM dev104-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 04:10:01 PM dev104-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 04:10:01 PM dev105-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 :
- Note that in the above case there are major device numbers 104 and 105. On this system 104 is cciss block devices. Add in 3rd party drivers which get their own major device numbers and it becomes difficult to say which devices should be included within a total io byte count moved to storage and which shouldn't. And that is exactly the problem facing sar and iostat: they don't know about storage topology or what's connected to what, so just total up everything.
The following information has been provided by Red Hat, but is outside the This content is not included.scope of coverage of the posted This content is not included.Service Level Agreements and support procedures. The information is provided as-is and any configuration settings or installed applications made from the information in this article could make the Operating System unsupported by Red Hat Global Support Services. The intent of this article is to provide information to accomplish the system's needs. Use of the information in this article at the user's own risk.
-
Example source for the 'iopseudo' program is available at This content is not included.'iopseudo' source but is for illustrative purposes only and, per the above, is provided as-is and is unsupported.
-
The following is an example of aggregating sdN devices within iostat into hostN pseudo devices using an internal test box.
$ lsscsi > lsscsi.out $ iostat -tkx 1 | ./iopseudo -m=lsscsi.out -Hv > sample.1
- Lets look at a one sample of output from the above, but first we'll remove all non-sdN lines, partitions, and the idle devices. The highlighted disks all have scsi addresses of 3:*:*:*, so are on Host3.
$ cat sample.1 | egrep -v "VxVM|dm\-|cciss" | grep -v 'sd[a-z]\+[!0-9]' | grep -v " 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00"Note: you may need to adjust the above grep depending on which RHEL version is being run and any third party storage packages are in use
12/13/2019 11:24:54 AM
avg-cpu: %user %nice %system %iowait %steal %idle
0.06 0.00 0.98 20.58 0.00 78.38Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdi 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdp 0.00 0.00 1.00 0.00 4.00 0.00 8.00 1.00 1.00 1.00 0.00 1000.00 100.00
sdq 0.00 0.00 18.00 0.00 1092.00 0.00 121.33 1.00 87.72 87.72 0.00 55.50 99.90
sds 0.00 0.00 2987.00 0.00 17922.00 0.00 12.00 1.97 0.85 0.85 0.00 0.33 100.00
sdr 0.00 0.00 2986.00 0.00 17916.00 0.00 12.00 0.97 0.33 0.33 0.00 0.33 97.30
sdt 0.00 0.00 5036.00 0.00 20144.00 0.00 8.00 0.95 0.17 0.17 0.00 0.19 95.40
sdv 0.00 0.00 19.00 0.00 1026.00 0.00 108.00 1.00 85.16 85.16 0.00 52.68 100.10
sdu 0.00 0.00 8960.00 0.00 35840.00 0.00 8.00 1.92 0.21 0.21 0.00 0.11 100.00
sdy 0.00 0.00 598.00 0.00 17940.00 0.00 60.00 2.00 4.26 4.26 0.00 1.67 100.10
sdz 0.00 0.00 1794.00 0.00 17934.00 0.00 19.99 1.96 1.37 1.37 0.00 0.56 100.10
sdab 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdcn 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 5.10
sdcv 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 4.00 4.00 0.00 4.00 0.40
sdfs 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdie 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdjr 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 1.00 1.00 0.00 1.00 0.10
sdjv 0.00 0.00 1.00 0.00 6.00 0.00 12.00 1.00 1528.00 1528.00 0.00 1001.00 100.10
sdka 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 100.10
sdla 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
Host3 0.00 0.00 8025.00 0.00 38072.00 0.00 4.74 2.97 0.23 0.23 0.00 0.37 100.00
Host2 0.00 0.00 14382.00 0.00 91780.00 0.00 6.38 11.85 0.98 0.98 0.00 0.49 100.10
- The lsscsi.out shows that devices sdp, sdr, sdt, sdcv, and sdie are all are on scsi Host3, and the rest of the devices are on Host2:
[3:0:1:2] disk LIO-ORG bud-2 4.0 /dev/sdp [3:0:1:3] disk LIO-ORG bud-3 4.0 /dev/sdr [3:0:1:4] disk LIO-ORG bud-4 4.0 /dev/sdt : [3:0:1:49] disk LIO-ORG test-14 4.0 /dev/sdcv : [3:0:1:120] disk LIO-ORG test-85 4.0 /dev/sdie
-
The simple iopseudo program looks up the sdN from within lsscsi.out to get its associated H:C:T:L scsi address and then it adds the sample's values to the hostH psuedo-device upon match. At the end of the sample, it flushes out the hostH aggregated data. The rrqm/s, wrqm/s, r/s, w/s rkB/s, and wkB/s are just added together. The rest of the data in the line, avgrq-sz -> %util can be ignored for now. What we're most interested in is the rkB/s and wkB/s fields within the hostN lines as these show how much data per second is being transferred through each HBA. Within the sample above, 38072kb/s and 0kb/s respectively.
-
The amount of bandwidth available for an FC HBA is based upon the link speed. On RHEL7:
$ grep -Hv "zz" /sys/class/fc_host/host*/speed sys/class/fc_host/host0/speed: 4 Gbit sys/class/fc_host/host1/speed: unknown sys/class/fc_host/host2/speed: 4 Gbit sys/class/fc_host/host3/speed: 4 Gbit
-
With a 4Gbit link, the absolute limit is 400MB/s reads on the incoming link plus 400MB/s written on the out going link. (Fibre channel uses a 10bit to 8 bit (byte) data encoding.) But this is raw link speed. With protocol and data framing overhead, etc., its more typical to see a nominal limit of somewhere beteen 250MB/s - 300MB/s based upon reasonable application loads. If the hostN lines are above 300MB/s for a 4Gbit link, then you are likely approaching the link transport limit.
-
For example, the following is from a system which also has 4Gb links and is showing:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util Host3 0.00 0.00 781.00 0.00 399872.00 0.00 1024.00 256.10 329.72 20.50 35.58 Host3 0.00 0.00 781.00 0.00 399872.00 0.00 1024.00 1024.82 1310.99 20.51 35.59 Host3 0.00 0.00 781.00 0.00 399872.00 0.00 1024.00 128.19 164.79 41.06 71.26 Host3 0.00 0.00 781.00 0.00 399872.00 0.00 1024.00 128.13 165.10 41.04 71.23 Host3 0.00 0.00 781.00 0.00 399872.00 0.00 1024.00 127.99 164.49 40.99 71.15 Host3 0.00 0.00 781.00 3.00 399872.00 12.00 1020.11 256.87 329.35 41.61 72.49
- The above is from an extreme case, but we can see that the system has reached the maximum available link bandwidth capacity.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.