How do I use hung task check in RHEL?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 5.5 (kernel-2.6.18-194) or later
  • D state (uninterruptible sleep) processes present in system

Issue

  • Can I detect a hung process?

  • What are the following variables? What do these hung task configuration parameters and their values mean and control?

    # sysctl -a --pattern hung
    kernel.hung_task_warnings = 10
    kernel.hung_task_timeout_secs = 120
    kernel.hung_task_check_count = 32768
    kernel.hung_task_panic = 0
    
  • How can I use hung task check?

  • How to automatically collect vmcore when "hung_task_timeout_secs" messages are logged?

    INFO: task <process>:<pid> blocked for more than 120 seconds.  
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.  
    
  • How to set 'hung_task_panic' parameter ?

  • When or under what circumstances should I avoid setting 'hung_task_panic'?

  • How can I reduce or disable the number of "task ... blocked for more than N seconds" events being logged?

Resolution

  • How to enable collection of a vmcore, via a panic, when a block task event is encountered:

    • The hung_task_panic parameter controls whether to panic or not upon detection of a task that is blocked in D state for more than hung_task_timeout_secs seconds. When hung_task_panic is enabled, the system will deliberately crash via a panic upon blocked task detection. A vmcore will then be captured if kdump is correctly configured on the system.

    • If you do not want hung_task_panic to persist across reboots, for example a one time vmcore collection upon stalled task detection:

      ```
      # echo '1' > /proc/sys/kernel/hung_task_panic 
      ```
      
    • If you want hung_task_panic to persist across reboots and shutdowns refer How to set sysctl variables on Red Hat Enterprise Linux or run the below command to make it persistent:

      ```
      # cp /etc/sysctl.conf  /etc/sysctl.conf.orig
      # sed -i '/kernel\.hung_task_panic/d' /etc/sysctl.conf
      # echo 'kernel.hung_task_panic = 1' >> /etc/sysctl.conf ; sysctl -p
      ```
      
  • Additional precursor setup to capture a vmcore:

  • The following sysctl parameters control the checking for blocked tasks behaviour:

    • First, to view the currently set values for blocked task configuration parameters, use either of the following methods:

      ```
      # grep -Hv "zz" /proc/sys/kernel/hung*
      /proc/sys/kernel/hung_task_check_count:4194304
      /proc/sys/kernel/hung_task_panic:0
      /proc/sys/kernel/hung_task_timeout_secs:120
      /proc/sys/kernel/hung_task_warnings:10
      
      # sysctl -q kernel | grep hung | sort
      kernel.hung_task_check_count = 4194304
      kernel.hung_task_panic = 0
      kernel.hung_task_timeout_secs = 120
      kernel.hung_task_warnings = 10
      ```
      
    • Note: changing any or all of the below parameters to new values does not require a reboot in order for them to take effect.

  • hung_task_check_count: Maximum count of processes to check

    • When there are more processes than this number on the system, not all processes will be checked for stalled, blocked, state.
    • The default value is 4,194,304 processes. If the current number of processes within the running kernel exceeds this value, only the first 4+ million currently existing processes will be checked.
  • hung_task_panic: Whether to panic on tasks that are blocked for more than the hung_task_timeout_secs value

  • hung_task_timeout_secs: Check interval

    • A warning will be issued when a process is blocked for more than the number of seconds specified by this parameter.
    • The default value is 120 seconds.
    • When set to 0, blocked task checking is disabled.
    • This should be set to a value greater than the IO timeout (/sys/block/<device>/queue/io_timeout), nominally should be set to 2x to 3x an IO timeout period otherwise a premature "task : blocked for more than 120 seconds" will result. The nominal setting of 2x IO timeout allows an initial IO timeout to occur and one retry before triggering hung task timeout logic. When the hung task timeout is less than the IO timeout value, then the hung task logic can trigger before the storage hardware has a fair chance at finishing even one IO try.
  • hung_task_warnings: Maximum number of warnings

    • After this many warnings are issued, no more will be shown until a reboot.
    • The default value is 10.
    • There exists special value -1 (available since kernel-3.10.0-548.el7) which, if set, will result in printing unlimited number of warnings.
    • NOTE: This value will be decreased by one after each warning is issued. When it reaches 0, warnings will be disabled. The value is reset after a reboot.

CAUTIONS!:

  • If hung_task_panic is enabled, this can cause false positives to unnecessarily panic your machine.

  • Although usually, a task being blocked for > 2 minutes is a good indication that something has gone wrong inside the kernel, there are circumstances in which a task can be legitimately blocked for this long, for example:

    • flushing large amounts of file system data to a slow device during umount() or sync() can take a longer than the default hung_task_timeout_secs
    • slow disk response results in timeouts and retries
      • virtual guests using virtual devices, such as RHEL guests under VMware, often have device default IO timeout values reset to 800-3600 seconds. Again, with the default or other low value within hung_task_timeout_secs and hung_task_panic enabled, the system can panic under a false trigger ("VM panics with blocked tasks due to slow or stalled I/O") as the system does not honor the IO timeout value before panic'ing the system.

      • misconfiguration of hung_task_timeout_secs vs the current IO timeout value, if the hung task timeout is less than IO timeout than the hung task logic will be triggered before storage has a sufficient amount of time to complete the io as allowed by the IO timeout value.

        /sys/block/vda/queue/io_timeout:300
        /proc/sys/kernel/hung_task_timeout_secs:120  << misconfigured, should be minimum of 2x300 or 600 seconds.
        
      • while the default hung_task_timeout_secs is 2x the default IO timeout value, different types of devices and different environments can change the default IO timeout value to a larger number of seconds causing false blocked task event triggers before io is allowed to complete, as well as panics if hung_task_panic is enabled.

      • when changing a devices timeout value to a higher number of seconds, a commensurate increase in the hung_task_timeout_secs setting should also be strongly considered

      • for example, if the disk timeout value is changed from 60s to 240s then hung_task_timeout_secs should be changed from 120s to 480s to match the allowed increased io timeout value

      • IBM's lin_tape driver for example often uses an IO timeout in the range of 18-24 minutes as some tape and changer IO can take nearly that long to complete successfully. With the default or other low value within hung_task_timeout_secs and hung_task_panic enabled, the system can panic under a false trigger condition("The IO time out values set in IBM lin_tape driver causes hung task panic ") as the system does not honor the IO timeout value before panic'ing the system.

  • As a general rule, the hung_task_timeout_secs should be set to a value larger than the longest expected IO timeout value to prevent false triggering of the stall task logic and its event output within messages. This is especially true if hung_task_panic is enabled as false triggering of the stalled task logic will cause induced kernel panics.

Root Cause

  • What is a 'hung' task?

    • A 'hung' task is a bit of a misnomer, as the hung task checking really is detecting tasks that are stalled within D (uninterruptible) state, that is the task is blocked waiting on the availability of a resource or event. Normally D state is a relatively short transitory state as the task waits for some resource. The most common resource is completion of pending IO, but waiting on other types of resources like locks or file system transactions (which potentially are made up of lots and lots of IO associated with flushing dirty cache pages and metadata) can result in tasks being stuck in D states for long periods of time.
    • The difference between a stalled task and a truely hung task is that a stalled task will eventually exit D state and continue processing where as a truly hung task stands little chance of leaving D state. From a practical standpoint its nearly impossible to determine if the condition is just a long stall or a true hang. To this end a reasonable time period is used to separate normal stalls from more likely true hung tasks. The setting of hung_task_timeout_secs specifies that time period after which a stalled task is considered a hung task.
      • This raises the question of what a reasonable time period setting should be. Common configurations where timeout seconds should be increased include virtual guests, configurations with tapes, and when dealing with NFS or other remove filesystems. In general, the value of hung_task_timeout_secs should be set to a high enough value that normal io and/or other resource stalls can occur without generating the blocked task events. What is "normal" will be configuration specific, although the default value is set to 2x the nominal io timeout value of 60 seconds. This allows at least one io timeout period and allows the io to be retried before classifying the task as hung versus merely temporarily stalled.
      • The following as some typical cases where hung_task_timeout_secs should be increased above its default value and/or hung_task_panic should be disabled.
      • Virtual guests: it is common to increase the io timeout value within virtual guests from the nominal 30-60 seconds to 600-1800 seconds. The hung_task_timeout_secs is typically a 2x multiple of that value, its default being 120 seconds. When the io timeout value is increased, then the hung_task_timeout_secs should track any such changes overwise false positives can be detected. If the configuration indicates its normal for io to take up to 1800 seconds but the system will start declaring any task waiting on io a "hung" task at 120 seconds then that is unreasonable declaration in light of the expectation that io may take up to 1800 seconds.
      • Tapes: Tape device have a very long default io timeout. Some tape indexing io calls, for example, on tape media changes can take upwards of 90 minutes and so their io timeouts are set high accordingly. When a tape drive is present and is actively being used -- for example during backups -- then increasing the hung_task_timeout_secs for the duration of tape use may need to be considered if the system generates a large number of hung task messages associated with tasks involved in tape usage.
        • NFS or similar remote file system: With remote NFS server congestion or network congestion that services same, then the default of 120 second value of hung_task_timeout_secs may need to be increased to avoid events from being unnecessarily logged.
        • Large physical memory configurations (128GB or larger): The default dirty ratio allows upwards of 40% of available memory to used to accumulate dirty buffer (delayed write) pages. Increasing the hung_task_timeout_secs value may be required to cover the maximum dirty pages case to allow that amount of dirty pages to be drained (written out) to disk. For example, 128GB x 40% = 51,000MB. If the storage device is capable of handling 100MB/s write rate then draining that amount of dirty pages can take 510 seconds or more. Either reducing the dirty ratio allowed or increasing the hung_task_timeout_secs, or a combination of the two, may be required in cases with large amounts of physical memory present.

Diagnostic Steps

When a blocked (stalled) task exceeds the hang_task_timeout_secs time limit and is classified as a hung task, messages similar to the following are output (it's important to capture both the warning header and the related call trace to attach into a support case/debug request):

INFO: task sendmail:14349 blocked for more than 120 seconds.  
"echo 0  /proc/sys/kernel/hung_task_timeout_secs" disables this message.  
sendmail D ffff810001004420 0 14349 16022 14355 12937 (NOTLB)  
ffff81001b72bdc8 0000000000000086 ffff8101d91be700 ffffffff800099ae  
ffff8101d8acf0c0 0000000000000009 ffff8100172527a0 ffffffff80309b60  
0010f4837aa8f02a 0000000000004646 ffff810017252988 0000000006a5ca98  
Call Trace:  
[ffffffff800099ae] __link_path_walk+0x173/0xf42  
[ffffffff8002cd2c] mntput_no_expire+0x19/0x89  
[ffffffff80064c6f] __mutex_lock_slowpath+0x60/0x9b  
[ffffffff800236d7] __path_lookup_intent_open+0x56/0x97  
[ffffffff80064cb9] .text.lock.mutex+0xf/0x14  
[ffffffff8001afe7] open_namei+0xea/0x6d5  
[ffffffff800274fb] do_filp_open+0x1c/0x38  
[ffffffff80019e1e] do_sys_open+0x44/0xbe  
[ffffffff8005e28d] tracesys+0xd5/0xe0

 

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.