Virtual machine reports a "BUG: soft lockup" (or multiple at the same time)

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL)
    • Any version
  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat Enterprise Linux CoreOS (RHCOS)
    • 4
  • Red Hat Satellite
    • 6
    • Discovery provisioning
  • Virtual machine

Issue

  • Virtual machine guest suffers multiple soft lockups at the same time.

  • We are experiencing kernel panic on virtual machines due to softlockup.

  • OpenShift node flapping in NotReady status.

  • Logs show messages like (examples from different sources):

    BUG: soft lockup - CPU#6 stuck for 73s! [flush-253:0:1207]
    BUG: soft lockup - CPU#7 stuck for 74s! [processname:15706]
    BUG: soft lockup - CPU#5 stuck for 63s! [processname:25582]
    BUG: soft lockup - CPU#0 stuck for 64s! [proceessname:15789]
    
    <time> <hostname> kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 25s! [ksoftirqd/6:38]
    <time> <hostname> kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [ksoftirqd/7:43]
    <time> <hostname> kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 24s! [NetworkManager:945]
    <time> <hostname> kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [watchdog/7:41]  
    

    The following line may (or not) also be logged:

    hrtimer: interrupt took NUMBER ns
    

Resolution

One of the most probable reason for those messages is the host/hypervisor being overcommitted or overloaded. Please, engage the hypervisor admins or vendor for assistance.

Note: OpenShift nodes are specially sensitive to this kind of issues, specially if processes like the Kubelet, the API Server or the etcd are affected, and can cause nodes flapping into NotReady status (automatically recovered when the Kubelet is able to report the status again), requests timeouts and in worst cases even etcd database inconsistencies.

If the host/hypervisor is clearly not overcommitted or overloaded, a vmcore will help to troubleshoot additional issues.

Note: in case this issue is encountered while doing a discovery provisioning on Red Hat Satellite 6, use a discovery ISO that has been provided by Red Hat in the satellite server itself and not some upstream foreman-discovery ISO.

Avoid kernel panics due to soft lockups

Ensure that the kernel.softlockup_panic sysctl parameter is set to 0:

  • Check the current value (the default value if not configured is 0/disabled):

    # sysctl kernel.softlockup_panic
    
  • Optionally, adjust the parameter in /etc/sysctl.conf and refresh configuration:

    # echo "kernel.softlockup_panic=0" >> /etc/sysctl.conf
    # sysctl -p
    

    NOTE: since RHEL 7, this parameter should be set to 0 by default in virtual machines.

Workaround to increase the time threshold for soft lockups

It is possible to increase the time threshold for a softlockup to get reported (max. is 60 seconds), which can omit shorter vCPU lags from printing these messages. Note that this configuration does not fix the real issue and only increases a timeout:

# echo kernel.watchdog_thresh=<new value> >> /etc/sysctl.conf                
# sysctl -p

NOTE: in older RHEL versions, the parameter's name was kernel.softlockup_thresh.

Root Cause

Getting a softlockup (ref. what is a soft lockup?) reported (namely for much longer than the set threshold) or even multiple softlockups reported at the same time on a virtual machine, especially (but not limited to) if they are reported in different processes and at places of execution where there is no reason for a lockup, is commonly a result of hypervisor not scheduling virtual CPUs in timely manner.

In other words these are results of large virtualization lag, probably due to hypervisor overcommitment. Please refer to virtualization lags and hypervisor overcommitment for more details about this topic.

Please note a soft lockup can occur for reasons other than hypervisor overcommittment, such as possible kernel bugs or bugs in 3rd party kernel modules. If the environment of concern exhibits the symptoms noted, please attempt the steps in the resolution section and lowering the workload on the hypervisor. Should the symptoms continue to occur after doing so, please do not hesitate to engage those responsible for providing support (either Red Hat directly or a 3rd party as defined by your support contract). Collecting a vmcore will help to troubleshoot the issue.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.