udevd worker unexpectedly returned with status 0x0100

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux(RHEL) 6, several minor versions
  • udevd

Issue

  • A host was losing paths to storage which is configured using device mapper multipath and as they came back, a lot of following errors occurred:

      udevd[11136]: worker [13191] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host2/rport-2:0-4/target2:0:2/2:0:2:31/scsi_device/2:0:2:31'
      udevd[11136]: worker [13192] unexpectedly returned with status 0x0100
      udevd[11136]: worker [13193] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host2/rport-2:0-5/target2:0:3/2:0:3:32/block/sdvv'
      udevd[11136]: worker [13204] unexpectedly returned with status 0x0100
    
  • When rebooting a system, error similar to the ones above are experienced

  • Rebooting RHEL 6 system after patch, the system fails to boot and boot hangs with udev errors

  • My RHEL 6 server sometime hang/kernel panic/reboot with udev error message: udevd worker unexpectedly returned with status 0x0100.

  • Oracle RAC server, Red Hat Enterprise Linux 6 hung with error message "udevd worker unexpectedly returned with status 0x0100" and needed to reboot manually to recover from the situation?

Resolution

  • Ensure the kernel options log_buf_len=4M or bigger is used. This is increasing the log buffer, preventing cases where the kernel might fill up the log with messages, i.e. regarding found LUNs.

  • Update the udev packages udev, libudev and libgudev1 to 147-2.63.el6_7.1 (released with RHBA-2015-2654) or later, which includes fixes for the known issues. After package update, rebuild the ramfs image and reboot after package upgrade:

    • yum update udev libudev libgudev1
    • dracut -f
    • Perform a cold boot up after the updating the packages.
      • shutdown -h now
      • Wait for few minutes and then boot up the system
  • Disable hal if possible: If the system is not used as a desktop, disable hal. Details can be found in What is hald service used for? .

  • In addition, if EMC powerpath is installed and the mentioned issue is observed, then update the powerpath software to an appropriate version after updating the RHEL OS. Contact EMC for assistance on this.

  • We have seen issues where starting the system with one CPU did restore normal operations. If such a situation is hit, 2 things should be done:

    • It should be attempted, if changing ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'" in file /lib/udev/rules.d/40-redhat.rules into #ACTION=="add", KERNEL=="cpu[0-9]*", RUN+="/bin/bash -c 'echo 1 > /sys/devices/system/cpu/%k/online'" leads to the system starting, also without kernel options restricting the initial CPUs to 1. This modification can be used, the only downside is that CPUs which would be added onthefly (without reboot) will not be automatically onlined. This can affect real hardware as well as virtual guests, i.e. KVM guests getting CPUs added.
    • If this change improves the situation, please contact the Red Hat Support and ask for a comment to be left in (private) bz1310159 .
  • If you believe you are still hitting this issue contact This content is not included.Red Hat Support to open a case and reference this article.

Root Cause

  • The number of spawned udevd workers depends only on the amount of RAM. As a result, for machines with relatively big RAM sizes and lots of disks, a lot of udevd workers are running in parallel, maximizing CPU and I/O. This can cause udev events to timeout, because of hardware bottlenecks.

  • A fix that helps govern multiple parallel driver loads that were occurring via modprobe to prevent unnecessary driver loads which contributed to high system resource use during device discovery which also could cause the udev events to timeout.

  • This and related issues were fixed within the udev-147-2.63.el6_7.1 via (private) bugzilla 1281469 and 1281467. Additional fixes are present from (private) bugzillas 1170313, 885978 and 816724 that address other related issues that contribute to 0x100 messages being displayed.

Diagnostic Steps

  • kernel option maxcpus=1 can be used as workaround in some cases (this option will lead to a performance degradation and is meant for debugging)

  • If the issue is occurring post-boot, enable additional udev logging.

    # udevadm control --log-priority=info
    
  • Try increasing the timeout by adding this line to /lib/udev/rules.d/10-dm.rules

    OPTIONS+="event_timeout=600"
    

    Like this

    ...
    ENV{DM_UDEV_RULES_VSN}="2"
    OPTIONS+="event_timeout=600" 
    ENV{DM_UDEV_DISABLE_DM_RULES_FLAG}!="1", ENV{DM_NAME}=="?*", SYMLINK+="(DM_DIR)/$env{DM_NAME}"
    ...
    
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.