Server with many LUNs/paths crashes while booting/rescanning for LUNs

Solution Verified - Updated 5 Aug 2024

Environment

Red Hat Enterprise Linux (RHEL), all versions
Storage Area Network (SAN)

Issue

Server with many LUNs/paths crashes while booting/rescanning for LUNs
I have hundreds of LUNs, and using multiple paths I get thousands of /dev/sd* devices. While rescanning, I see load of 200 or 300 on the system.

Resolution

The following points should be considered to prevent the issue:

It should be ensured that udev is updated to the versions mentioned in udevd worker unexpectedly returned with status 0x0100
If serial console is configured, this should be used: console=ttyS0,115200n8. Lower speeds can lead to soft lockups.
If the system is not used as desktop, disable hal. Details can be found in What is hald service used for? .
The LUN and path numbers should be kept in sane areas. If the requirements allow it, reduce the number of paths / LU to 4 and also reduce the number of LUs per host since scanning duration is a product of paths per LU and LU. Furthermore we have to identify every device via 8 paths using scsi_id / blkid which brings a very high load.
Are scsi errors originating from the fiber network in the logs?
If automatic cpu onlining is not required, then the cpu onlining rule /lib/udev/rules.d/40-redhat.rules should be commented out to match #ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'". Details are available in udevd worker unexpectedly returned with status 0x0100.
Is the SAN storage or fiber network having performance issues, slowing down all accesses?
Use the kernel options loglevel=4 log_buf_len=8M. This is ensuring a sane loglevel and increasing the log buffer, preventing cases where the kernel might fill up the log with messages, i.e. regarding found LUNs. Reducing the loglevel helps here too.
Using the following grub options ensures a reduced number of udevchilds and increases the udevtimeout for workers: udevchilds=25 udevtimeout=600
When rescanning, do not use "echo - - - > /sys/..." approach since this is the "hard" approach. Either use scsi-rescan or add LUs with known parameters as described in The Storage Administration Guide.
Check the 3rd party udev rules (udevadm monitor while scanning e.g.) if there are improvements possible.

Root Cause

Multiple factors can contribute to such issues:

LUN number and path numbers
speed of the HBA, fiber network and SAN target
number of the udev children in use, udev timeout
deployed 3rd party udev rules. Every software can deploy these in /etc/udev/rules.d and it contributes to the load when scanning.

SBR

Storage

Product(s)

Red Hat Enterprise Linux

Components

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.