Processes hanging and causing system to crash, due to NFS server not responding.

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux 5, 6, 7, 8, 9

Issue

  • NetApp NFS server not responding leads to process hangs in the client (RHEL) machine.

Resolution

  • The NFS server is a NetApp filer. The applications on the client machine went to D state, since the NFS server was not responding properly in time.

  • Check with the NFS server vendor (NetApp) to find out why the server didn't respond properly.

Root Cause

  • The client dumped a vmcore, since 'hung_task_panic' was set. Dicing up the core revealed the following:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.18-308.el5/vmlinux
    DUMPFILE: /cores/retrace/tasks/595400495/crash/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Tue Jan 21 05:13:02 2014
      UPTIME: 111 days, 01:12:13
LOAD AVERAGE: 17.15, 10.18, 7.34
       TASKS: 442
    NODENAME: <hostname>
     RELEASE: 2.6.18-308.el5
     VERSION: #1 SMP Fri Jan 27 17:17:51 EST 2012
     MACHINE: x86_64  (2333 Mhz)
      MEMORY: 15.8 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
  • From the log, it seems that the system panicked because a task had been blocked for at least 2 minutes.
INFO: task perl:31376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
perl          D ffff810021036b20     0 31376  21705         31683       (NOTLB)
 ffff8100562a5da8 0000000000000082 0000000000000000 ffffffff8892f9f8
 0000000000000050 0000000000000008 ffff81040ecdc0c0 ffff81042feae080
 00220573c8112272 0000000000000806 ffff81040ecdc2a8 00000006a7d3f978
Call Trace:
 [<ffffffff8892f9f8>] :sunrpc:xprt_reserve+0x14d/0x161
 [<ffffffff8006ecd9>] do_gettimeofday+0x40/0x90
 [<ffffffff88a1e40d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff800637de>] io_schedule+0x3f/0x67
 [<ffffffff88a1e416>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd
 [<ffffffff80063a0a>] __wait_on_bit+0x40/0x6e
 [<ffffffff88a1e40d>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
 [<ffffffff80063aa4>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a3472>] wake_bit_function+0x0/0x23
 [<ffffffff88a23139>] :nfs:nfs_sync_inode_wait+0xe0/0x2d4
 [<ffffffff88a18ecd>] :nfs:nfs_do_fsync+0x22/0x42
 [<ffffffff80023c7b>] filp_close+0x36/0x64
 [<ffffffff8001e103>] sys_close+0x88/0xbd
 [<ffffffff800614b5>] sysenter_do_call+0x1e/0x76

Kernel panic - not syncing: hung_task: blocked tasks
  • This particular task was flushing data to the NFS server and is waiting for all outstanding requests to complete.
crash> set 31376
crash> bt 31376
PID: 31376  TASK: ffff81040ecdc0c0  CPU: 6   COMMAND: "perl"
 #0 [ffff8100562a5cd8] schedule at ffffffff80062fa0
 #1 [ffff8100562a5db0] io_schedule at ffffffff800637de
 #2 [ffff8100562a5dd0] nfs_wait_bit_uninterruptible at ffffffff88a1e416 [nfs]
 #3 [ffff8100562a5de0] __wait_on_bit at ffffffff80063a0a
 #4 [ffff8100562a5e20] out_of_line_wait_on_bit at ffffffff80063aa4
 #5 [ffff8100562a5e90] nfs_sync_inode_wait at ffffffff88a23139 [nfs]
 #6 [ffff8100562a5f20] nfs_do_fsync at ffffffff88a18ecd [nfs]
 #7 [ffff8100562a5f40] filp_close at ffffffff80023c7b
 #8 [ffff8100562a5f60] sys_close at ffffffff8001e103
 #9 [ffff8100562a5f80] sysenter_do_call at ffffffff800614b5
    RIP: 00000000ffffe410  RSP: 00000000ffb78eac  RFLAGS: 00000296
    RAX: ffffffffffffffda  RBX: ffffffff800614b5  RCX: 0000000000000000
    RDX: 0000000000000013  RSI: 000000000b2064b0  RDI: 0000000000000000
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 00000000ffb78ed8
    R13: 0000000000000013  R14: ffffffff8001e103  R15: ffff810122816180
    ORIG_RAX: 0000000000000006  CS: 0023  SS: 002b
  • It seems that the NFS server is not responding properly. This can happen if the NFS server is loaded at the time of the problem. There doesn't seem to be any logs about losing connectivity with the NFS server.

  • It appears the cause of the panic was that the hung task watchdog detected that a task bad been blocked for more than 2 minutes and the hung_task_panic setting was enabled. The task that was reported as blocked was waiting for a response from the NFS server. There are no messages to indicate that the NFS client lost connection with the server so it suggests the server was too busy to respond to the client before the hung task watchdog kicked in.

Diagnostic Steps

  • Checked the vmcore for the blocked process, and where it was waiting.

  • Checked the client (RHEL) logs to understand if there is a network disconnect from the server.

Components
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.