NFS remove operations taking a long time and tripping hung_task backtraces and/or hung_task panics

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6, 7, 8, 9 (NFS client)
  • seen with NetApp filer (NFS server) and large file removes

Issue

  • experiencing hung_task panics or in general slow NFS remove operations
  • removing a NFS file is slow with NetApp NFS server

Resolution

  • Contact NFS server vendor for performance analysis and/or tuning.

  • Workarounds

    • Ensure any NFS client operation is tolerant of NFS REMOVE operations which may take a very long time.
    • Throttle the application removing files so that a huge load of NFS REMOVE operations are not sent to the NFS server at the same time.
    • If hung_task_panic or hung_task checks are used, configure hung_task_timeout_secs to be be a larger value than the default of 120 seconds. Configure a value which is believed to be: a) greater than the worst case time for a single NFS REMOVE operation, and b) greater than the sum of times for simultaneous removes in a single NFS directory (processes removing files at the same time in the same directory will wait in sequence for the directory inode mutex, so may trip the hung task panic).

Root Cause

  • File remove operations may require a significant amount of background work from an NFS server, so they may take an extended period of time.
  • This may be seen with multi-threaded cleanup jobs, which remove a lot of files in parallel in a short period of time.

Diagnostic Steps

  • Gather a tcpdump during a time when NFS REMOVE operations are taking place. Process the workload with tshark or other tool which shows latency of NFS operations looking for either: a) a lot of NFS REMOVE operations, or b) NFS REMOVE operations which have a large maximum or average latency (with tshark this is represented by "Max RTT" / "Max SRT" and "Avg RTT" / "Avg SRT" columns). For example, using How to use tshark to generate statistics about RPC calls on a Red Hat Enterprise Linux 5 system
$ tshark -q -z rpc,rtt,100003,3 -r tcpdump.cap > /tmp/tcpdump-rpc-stats.txt
$ egrep '(^REMOVE|Procedure)' /tmp/tcpdump-rpc-stats.txt
Procedure        Calls   Min RTT   Max RTT   Avg RTT
REMOVE             197   3.00012   129.059   35.0018
  • On NFS client, configure hung_task_panic to capture a vmcore when any task is blocked for a significant amount of time. For more information on configuration of hung_task_panic, see How do I use hung task check?.

Detailed analysis (from vmcore)

  • Analyze the vmcore for evidence that the hung_task timer was tripped due to the underlying reason of a slow NFS REMOVE operation. Note that the task which trips the hung_task timeout may not be the one which is directly waiting on NFS REMOVE operation, but it may be "waiting in line" for a resource (such as a directory inode mutex) held by another process which is ultimately waiting for the NFS REMOVE operation to complete.
crash> sys | grep PANIC
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
crash> sys | grep RELEASE
     RELEASE: 2.6.32-358.12.1.el6.x86_64
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $4 = 300
crash> log | tail -40
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver02 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver04 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver01  returned a bad sequence-id error!
NFS: v4 server nfsserver03 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver05 does not accept raw uid/gids. Reenabling the idmapper.
nfs: server nfsserver03 not responding, still trying
INFO: task pmserver:8188 blocked for more than 300 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pmserver      D 0000000000000015     0  8188   7181 0x00000080
 ffff880f1ca77d38 0000000000000082 0000000000000000 0000000000000001
 ffff880f1ca77cc8 ffffffffa039e4c2 ffff880b77e56026 ffff880f1ca77e28
 ffff880cda521058 ffff880f1ca77fd8 000000000000fb88 ffff880cda521058
Call Trace:
 [<ffffffffa039e4c2>] ? nfs_permission+0xb2/0x1e0 [nfs]
 [<ffffffff8150f2ae>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f14b>] mutex_lock+0x2b/0x50
 [<ffffffff81192b77>] do_filp_open+0x2d7/0xdc0
 [<ffffffff8113deb0>] ? sys_madvise+0x350/0x7b0
 [<ffffffff8119f622>] ? alloc_fd+0x92/0x160
 [<ffffffff8117dfb9>] do_sys_open+0x69/0x140
 [<ffffffff8100c535>] ? math_state_restore+0x45/0x60
 [<ffffffff8117e0d0>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 257, comm: khungtaskd Not tainted 2.6.32-358.12.1.el6.x86_64 #1
Call Trace:
 [<ffffffff8150d538>] ? panic+0xa7/0x16f
 [<ffffffff810e0ae7>] ? watchdog+0x217/0x220
 [<ffffffff810e08d0>] ? watchdog+0x0/0x220
 [<ffffffff81096936>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

crash> set 8188
    PID: 8188
COMMAND: "pmserver"
   TASK: ffff880cda520aa0  [THREAD_INFO: ffff880f1ca76000]
    CPU: 21
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 8188   TASK: ffff880cda520aa0  CPU: 21  COMMAND: "pmserver"
 #0 [ffff880f1ca77c78] schedule at ffffffff8150dc02
 #1 [ffff880f1ca77d40] __mutex_lock_slowpath at ffffffff8150f2ae
 #2 [ffff880f1ca77db0] mutex_lock at ffffffff8150f14b
 #3 [ffff880f1ca77dd0] do_filp_open at ffffffff81192b77
 #4 [ffff880f1ca77f20] do_sys_open at ffffffff8117dfb9
 #5 [ffff880f1ca77f70] sys_open at ffffffff8117e0d0
 #6 [ffff880f1ca77f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007fe3273e3d8d  RSP: 00007fe274df9c68  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffffffff8100b072  RCX: 0000000000000000
    RDX: 00000000000001b6  RSI: 0000000000000442  RDI: 00007fe258006130
    RBP: 0000000000000000   R8: 00007fe258000088   R9: 0000000000000001
    R10: 000000000000006f  R11: 0000000000000293  R12: ffffffff8117e0d0
    R13: ffff880f1ca77f78  R14: 00007fe274df9d30  R15: 00007fe274df9c90
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

Get the mutex and the owner.

crash> dis -lr ffffffff8150f14b | tail
0xffffffff8150f136 <mutex_lock+22>:     mov    %rdi,%rbx
/usr/src/debug/kernel-2.6.32-358.12.1.el6/linux-2.6.32-358.12.1.el6.x86_64/kernel/mutex.c: 94
0xffffffff8150f139 <mutex_lock+25>:     callq  0xffffffff8150e5e0 <_cond_resched>
/usr/src/debug/kernel-2.6.32-358.12.1.el6/linux-2.6.32-358.12.1.el6.x86_64/kernel/mutex.c: 99
0xffffffff8150f13e <mutex_lock+30>:     mov    %rbx,%rdi  <---- mutex *
0xffffffff8150f141 <mutex_lock+33>:     lock decl (%rdi)
0xffffffff8150f144 <mutex_lock+36>:     jns    0xffffffff8150f14b <mutex_lock+43>
0xffffffff8150f146 <mutex_lock+38>:     callq  0xffffffff8150f170 <__mutex_lock_slowpath>
/usr/src/debug/kernel-2.6.32-358.12.1.el6/linux-2.6.32-358.12.1.el6.x86_64/arch/x86/include/asm/thread_info.h: 216
0xffffffff8150f14b <mutex_lock+43>:     mov    %gs:0xcbc8,%rax

kernel/mutex.c
92 void __sched mutex_lock(struct mutex *lock)
93 {
94 	might_sleep();
95 	/*
96 	 * The locking fastpath is the 1->0 transition from
97 	 * 'unlocked' into 'locked' state.
98 	 */
99-->	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
100 	mutex_set_owner(lock);
101 }
crash> struct -o mutex | grep count
   [0] atomic_t count;

crash> dis __mutex_lock_slowpath | head 
0xffffffff8150f170 <__mutex_lock_slowpath>:     push   %rbp
0xffffffff8150f171 <__mutex_lock_slowpath+1>:   mov    %rsp,%rbp
0xffffffff8150f174 <__mutex_lock_slowpath+4>:   push   %r15
0xffffffff8150f176 <__mutex_lock_slowpath+6>:   push   %r14
0xffffffff8150f178 <__mutex_lock_slowpath+8>:   push   %r13
0xffffffff8150f17a <__mutex_lock_slowpath+10>:  push   %r12
0xffffffff8150f17c <__mutex_lock_slowpath+12>:  push   %rbx  <------ mutex * 6th push after RA
crash> bt | grep mutex_lock
 #1 [ffff880f1ca77d40] __mutex_lock_slowpath at ffffffff8150f2ae
 #2 [ffff880f1ca77db0] mutex_lock at ffffffff8150f14b
crash> px (0xffff880f1ca77db0 - 6*8)
$2 = 0xffff880f1ca77d80
crash> rd 0xffff880f1ca77d80
ffff880f1ca77d80:  ffff88000f4949d8                    .II.....
crash> mutex.owner ffff88000f4949d8
  owner = 0xffff88099c3de000
crash> thread_info.task 0xffff88099c3de000
  task = 0xffff88091c238040
crash> bt 0xffff88091c238040
PID: 7217   TASK: ffff88091c238040  CPU: 9   COMMAND: "pmserver"
 #0 [ffff88099c3dfa68] schedule at ffffffff8150dc02
 #1 [ffff88099c3dfb30] rpc_wait_bit_killable at ffffffffa031d4e2 [sunrpc]
 #2 [ffff88099c3dfb40] __wait_on_bit at ffffffff8150ed9f
 #3 [ffff88099c3dfb90] out_of_line_wait_on_bit at ffffffff8150ee48
 #4 [ffff88099c3dfc00] __rpc_execute at ffffffffa031da25 [sunrpc]
 #5 [ffff88099c3dfc70] rpc_execute at ffffffffa031dce1 [sunrpc]
 #6 [ffff88099c3dfca0] rpc_run_task at ffffffffa03143a5 [sunrpc]
 #7 [ffff88099c3dfcc0] rpc_call_sync at ffffffffa03144c2 [sunrpc]
 #8 [ffff88099c3dfd20] nfs3_rpc_wrapper.clone.0 at ffffffffa03b5f7d [nfs]
 #9 [ffff88099c3dfd60] nfs3_proc_remove at ffffffffa03b65ab [nfs]
#10 [ffff88099c3dfe20] nfs_unlink at ffffffffa039ea98 [nfs]
#11 [ffff88099c3dfe70] vfs_unlink at ffffffff8118fb5f
#12 [ffff88099c3dfea0] do_unlinkat at ffffffff81191ee5
#13 [ffff88099c3dff70] sys_unlink at ffffffff81191fb6
#14 [ffff88099c3dff80] system_call_fastpath at ffffffff8100b072
    RIP: 00007fe31e4cc7d7  RSP: 00007fe2997fa8d0  RFLAGS: 00000246
    RAX: 0000000000000057  RBX: ffffffff8100b072  RCX: 00007fe31cf2b428
    RDX: 000000000000006c  RSI: 00007fe27883a870  RDI: 00007fe27883a870
    RBP: 00007fe2781f7fb0   R8: 00007fe278000088   R9: 00007fe2997faa01
    R10: 000000000000006d  R11: 0000000000000202  R12: ffffffff81191fb6
    R13: ffff88099c3dff78  R14: 00007fe290cd0ac0  R15: 0000000000000001
    ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b

Get the rpc task
crash> dis -r ffffffffa03193a5 | tail -4
0xffffffffa0319398 <rpc_run_task+104>:  lock incl (%r12)
0xffffffffa031939d <rpc_run_task+109>:  mov    %r12,%rdi  <--- rpc_task *
0xffffffffa03193a0 <rpc_run_task+112>:  callq  0xffffffffa0322c80 <rpc_execute>
0xffffffffa03193a5 <rpc_run_task+117>:  pop    %rbx
crash> dis rpc_execute | head 
0xffffffffa0322c80 <rpc_execute>:       push   %rbp
0xffffffffa0322c81 <rpc_execute+1>:     mov    %rsp,%rbp  <--- rbp from %rsp
0xffffffffa0322c84 <rpc_execute+4>:     sub    $0x20,%rsp
0xffffffffa0322c88 <rpc_execute+8>:     mov    %rbx,-0x18(%rbp)
0xffffffffa0322c8c <rpc_execute+12>:    mov    %r12,-0x10(%rbp)  <---- rpc_task *

crash> set 7217
    PID: 7217
COMMAND: "pmserver"
   TASK: ffff88091c238040  [THREAD_INFO: ffff88099c3de000]
    CPU: 9
  STATE: TASK_UNINTERRUPTIBLE|TASK_WAKEKILL 
crash> bt | grep rpc_run_task
 #6 [ffff88099c3dfca0] rpc_run_task at ffffffffa03143a5 [sunrpc]
crash> px (0xffff88099c3dfca0 - 1*8 - 0x10)
$3 = 0xffff88099c3dfc88
crash> rd 0xffff88099c3dfc88
ffff88099c3dfc88:  ffff88100bc6de80                    ........
crash> source ~/crash/nfs/rhel6-rpc-show-tasks
crash> rhel-rpc-task-summary 0xffff88100bc6de80
  (struct rpc_task *) 0xffff88100bc6de80
  rpc_task.tk_rqstp = (struct rpc_rqst *)0xffff881011b76e00 rpc_rqst.rq_xprt = (struct rpc_xprt *)0xffff880810874800
  rpc_rqst: rq_xid = 0xb4c69fa9
  rpc_rqst: rq_retries = 1, rq_ntrans = 6, rq_bytes_sent = 0, (last_jiffies_update - rq_xtime)/10^9 = 13 seconds
    (last_jiffies_update - tk_start)/10^9 = 433 seconds, tk_start = 1201721962889157, tk_timeouts = 2
    tk_owner = 7182, tk_flags = 0x8a0, tk_status = 0, tk_runstate = 6, tk_timeout = 120000
    tk_msg.rpc_proc->pname = REMOVE 
      (struct nfs_removeargs *)0xffff88099c3dfdb8 (struct nfs_removeres *)0xffff88099c3dfd68
    tk_action = 0xffffffffa0313410, &u.tk_work = 0xffff88100bc6df08, tk_workqueue = (nil)
    tk_priority = 1, tk_waitqueue = 0xffff880810874a58  tk_waitqueue.name = xprt_pending
    rpc_task.tk_client = 0xffff880811fad800 rpc_task.tk_client->cl_server = nfsserver03

The rq_ntrans value of 6 indicates it's been transmitted 6 times.  The last
time it was transmitted was 13 seconds ago.  The total time this NFS REMOVE
has been outstanding is 433 seconds, which exceeds sysctl_hung_task_timeout_secs.
The tk_timeouts value of 2 indicates the task has encountered 2 major
timeouts, hence the "not responding, still trying" messages.

crash> log | grep nfsserver03
NFS: v4 server nfsserver03 does not accept raw uid/gids. Reenabling the idmapper.
NFS: v4 server nfsserver03 does not accept raw uid/gids. Reenabling the idmapper.
nfs: server nfsserver03 not responding, still trying

crash> source ~/crash/nfs/rhel6-rpc-xprt-summary
crash> rhel6-rpc-xprt-summary 0xffff880810874800

RPC Transport State Summary for (struct rpc_xprt *)0xffff880810874800
=====================================================================
rpc_xprt.state = 0x12 (For details, use 'rhel-rpc-xprt-states')
- OK: rpc_xprt.state bit 2 set (XPRT_CONNECTED)
rpc_xprt.prot = 6 (TCP)
- sock.__sk_common.skc_state = 1 (See 'rhel-tcp-states' for TCP state list)
- OK: skc_state == 1 (TCP_ESTABLISHED)

RPC Transport other structures (struct rpc_xprt *)0xffff880810874800
=====================================================================
xprt.inet = (struct sock *)0xffff880812af4740
xprt.sock = (struct socket *)0xffff8808111a26c0
xprt.sock = (struct tcp_sock *)0xffff880812af4740
rpc_xprt.xprt_net = (struct net *)0xffffffff81b16580
Slot table free list: (struct list_head *) 0xffff880810874be8
Slot table entries: tcp = 2, udp = 16
snd_task = (struct rpc_task *)(nil)
RPC wait queues
- binding (struct rpc_wait_queue *) 0xffff8808108748c8: qlen = 0, owner = 0
- sending (struct rpc_wait_queue *) 0xffff880810874990: qlen = 0, owner = 0
- backlog (struct rpc_wait_queue *) 0xffff880810874b20: qlen = 0, owner = 0
- pending (struct rpc_wait_queue *) 0xffff880810874a58: qlen = 1, owner = 0
non-xprt queues
- delay_queue (struct rpc_wait_queue *)0xffffffffa033db00 qlen = 0
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.