Server with many LUNs/paths crashes while booting/rescanning for LUNs
Environment
- Red Hat Enterprise Linux (RHEL), all versions
- Storage Area Network (SAN)
Issue
- Server with many LUNs/paths crashes while booting/rescanning for LUNs
- I have hundreds of LUNs, and using multiple paths I get thousands of /dev/sd* devices. While rescanning, I see load of 200 or 300 on the system.
Resolution
The following points should be considered to prevent the issue:
- It should be ensured that udev is updated to the versions mentioned in udevd worker unexpectedly returned with status 0x0100
- If serial console is configured, this should be used:
console=ttyS0,115200n8. Lower speeds can lead to soft lockups. - If the system is not used as desktop, disable hal. Details can be found in What is hald service used for? .
- The LUN and path numbers should be kept in sane areas. If the requirements allow it, reduce the number of paths / LU to 4 and also reduce the number of LUs per host since scanning duration is a product of paths per LU and LU. Furthermore we have to identify every device via 8 paths using scsi_id / blkid which brings a very high load.
- Are scsi errors originating from the fiber network in the logs?
- If automatic cpu onlining is not required, then the cpu onlining rule
/lib/udev/rules.d/40-redhat.rulesshould be commented out to match#ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'". Details are available in udevd worker unexpectedly returned with status 0x0100. - Is the SAN storage or fiber network having performance issues, slowing down all accesses?
- Use the kernel options
loglevel=4 log_buf_len=8M. This is ensuring a sane loglevel and increasing the log buffer, preventing cases where the kernel might fill up the log with messages, i.e. regarding found LUNs. Reducing the loglevel helps here too. - Using the following grub options ensures a reduced number of udevchilds and increases the udevtimeout for workers:
udevchilds=25 udevtimeout=600 - When rescanning, do not use "echo - - - > /sys/..." approach since this is the "hard" approach. Either use
scsi-rescanor add LUs with known parameters as described in The Storage Administration Guide. - Check the 3rd party udev rules (udevadm monitor while scanning e.g.) if there are improvements possible.
Root Cause
Multiple factors can contribute to such issues:
- LUN number and path numbers
- speed of the HBA, fiber network and SAN target
- number of the udev children in use, udev timeout
- deployed 3rd party udev rules. Every software can deploy these in /etc/udev/rules.d and it contributes to the load when scanning.
SBR
Product(s)
Components
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.