Upgrading to RHEL 6.6, 7.0 or 7.1 may result in an application, using futexes, appearing to stall in futex_wait()
Environment
-
Red Hat Enterprise Linux (RHEL) 6.6, 7.0 and 7.1
-
Application uses syscall FUTEX with Private Userspace futex locking
- Does not affect Shared or Inode Futex locking
-
Other possible packages may include:
- IBM JDK 6 32-bit
- Oracle JDK 7 64-bit
Issue
- Softlockup with pThreads, Mutexes on Haswell CPUs and PowerPC CPUs (but may not be limited to just these)
- After upgrading to Red Hat Enterprise Linux 6.6 (specifically 2.6.32-504 up to and including 2.6.32-504.12.2) may result in an application hang.
- Cannot get a thread dump using
kill -3. - Running
kill -3doesn't respond. - Inspecting
/proc/shows all threads are stuck waiting on a futex. For example:
Where nnnn = the PID # of the parent task in question
mmmm = the PID # of the thread task
Note. Under the parent /proc/nnnn/task structure you will see the PID's of all of the threads including the parents number repeated.
cat /proc/nnnn/task/mmmm/stack
[<ffffffff810b226a>] futex_wait_queue_me+0xba/0xf0
[<ffffffff810b33a0>] futex_wait+0x1c0/0x310
[<ffffffff810b4c91>] do_futex+0x121/0xae0
[<ffffffff810b56cb>] sys_futex+0x7b/0x170
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
- Attaching
gdborstracecauses the application to wake up and continue processing.
Resolution
Red Hat Enterprise Linux 6
This issue was originally tracked in a private bugzilla by Red Hat and has subsequently been addressed. In order to fix this issue, an update to the following kernel version (or later) within 6.6 will be required: kernel-2.6.32-504.16.2.el6, released with RHSA-2015-0864. RHEL6.7GA and later already include the fix for this issue.
Red Hat Enterprise Linux 7
This issue has been fixed in a RHEL7.1.z errata. Update the kernel to 3.10.0-229.7.2.el7 (released with RHSA-2015-1137) or later.
Root Cause
For private a futex, none of the cases in the switch statement of get_futex_key_refs() would be hit and the function completes without a memory barrier as required before checking the waiters in futex_wake() -> hb_waiters_pending(). The consequence is a race with a thread waiting on a futex on another CPU, allowing the waker thread to read waiters == 0 while the waiter thread read futex_val == locked.
Diagnostic Steps
Analysis of this problem ideally requires a vmcore.
With a VMcore
In order to really determine the bugzilla will resolve this you need to inspect futex_q's as well as their attached futex_hashed_bucket. That is because waiting on a futex (syscall FUTEX with FUTEX_WAIT_PRIVATE op which resolves to FUTEX_WAIT and FUTEX_PRIVATE_FLAG) is not in itself an illegal or invalid condition.
Specifically you need to review the futex_hashed_bucket of a process that has been deemed to be stalled for an extended period of time (as determined by the customer). In that structure, there is a count of 'waiters'. To find this you must first locate the futex_q associated with the stalled task/PID. Here's one as an example.
The address of the futex_q is held in the kernel stack for the task. To locate, look in the futex_wait_queue_me() stack entry for the address. Alternatively, the futex_q itself is specifically located in the futex_wait() stack entry. The actual offsets will depend specifically on the hardware platform. Included are examples from Intelx86_64 and PPC64. Use 'bt -FFls' to obtain the stack listing as this will make it easier to spot the pointers and queue. For example, the futex_q always contains the task_struct address of the task owner. Looking for this makes it easier to spot in the stack.
Intel x86_64/AMD64
#1 [ffff882cebca7c00] futex_wait_queue_me+0xba at ffffffff810b226a
/usr/src/debug/kernel-2.6.32-504.8.1.el6/linux-2.6.32-504.8.1.el6.x86_64/arch/x86/include/asm/current.h: 14
ffff882cebca7c08: [ffff882cebca7d08:TCP] [ffff882cebca7c88:TCP]
^^^^^^^^^^^^^^^^
ffff882cebca7c18: 0000000000000000 [ffff882315faf540:task_struct]
ffff882cebca7c28: [ffff882cebca7d00:TCP] 00000000023efb54
ffff882cebca7c38: [ffff882cebca7da8:TCP] futex_wait+0x1c0
#2 [ffff882cebca7c40] futex_wait+0x1c0 at ffffffff810b33a0
/usr/src/debug/kernel-2.6.32-504.8.1.el6/linux-2.6.32-504.8.1.el6.x86_64/kernel/futex.c: 1716
ffff882cebca7c48: [ffff883fb3abf540:task_struct] ffffc90028ee9204
ffff882cebca7c58: 00000000ebca7c98 0000000000000000
ffff882cebca7c68: ffffffffebca7c78 [ffff882cebca7cc0:TCP]
ffff882cebca7c78: 0000021b20e95928 0000000000000000
futex_q
ffff882cebca7c88: 0000000000000064 ffffc90028ef0fc8
ffff882cebca7c98: ffffc90028ef0fc8 ffffc90028ef0fd8
ffff882cebca7ca8: ffffc90028ef0fd8 [ffff882315faf540:task_struct]
ffff882cebca7cb8: ffffc90028ef0fc4 00000000023ef000
ffff882cebca7cc8: [ffff88204c9ad8c0:mm_struct] 0000000000000b54
ffff882cebca7cd8: 0000000000000000 0000000000000000
ffff882cebca7ce8: 0000000000000000 ffff8820ffffffff
ffff882cebca7cf8: [ffff882cebca7d08:TCP] ffffc90028ef0fc0
ffff882cebca7d08: ffff883f71b93d09 00000000ffffffff
ffff882cebca7d18: [ffff882cebca7d58:TCP] 0000000000000001
ffff882cebca7d28: 00000000ffffffea 0000000000000000
ffff882cebca7d38: [ffff882cebca7da8:TCP] futex_wake+0x93
ffff882cebca7d48: 00000000000002fb hrtimer_start_range_ns+0x14
ffff882cebca7d58: 00007f47e482f000 [ffff88204c9ad8c0:mm_struct]
ffff882cebca7d68: 0000000000000828 000000005c13213f
ffff882cebca7d78: [ffff882cebca7d88:TCP] 0000000000000000
ffff882cebca7d88: 000000000000021b 00000000023efb00
ffff882cebca7d98: 00000000023efb54 0000000000000000
ffff882cebca7da8: [ffff882cebca7ee8:TCP] do_futex+0x121
PPC64
#1 [c000000e75ad3950] .futex_wait_queue_me+0xf0 at c0000000000d8370
c000000e75ad3950: [c000000e75ad39f0:thread_info] 00000000a4160b08
c000000e75ad3960: .futex_wait_queue_me+0xf0 [c000000e75ad3a68:thread_info]
c000000e75ad3970: [c000000e75ad39f0:thread_info] [c000000feeb0e180:mm_struct]
c000000e75ad3980: 0001d935000c71a0 [c000000d0f484f60:task_struct]
c000000e75ad3990: 0000000000000000 [c000000e75ad3b08:thread_info]
c000000e75ad39a0: 0000000000000000 [c000000e75ad3a60:thread_info]
c000000e75ad39b0: [c000000e75ad3a68:thread_info] [c000000e75ad3ad0:thread_info]
^^^^^^^^^^^^^^^^
c000000e75ad39c0: 00000000ffffffff [c000000e75ad3dc0:thread_info]
c000000e75ad39d0: 0000000000000000 00000000a4160b08
c000000e75ad39e0: mv88e6131_switch_driver+0xe8b8 000000000001d935
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.ppc64/kernel/futex.c: 1984
#2 [c000000e75ad39f0] .futex_wait+0x1a8 at c0000000000d8618
c000000e75ad39f0: [c000000e75ad3bd0:thread_info] 2428808200000000
c000000e75ad3a00: .futex_wait+0x1a8 init_thread_union+0x118
c000000e75ad3a10: c000000089704b10 0000000000000000
c000000e75ad3a20: c000000089704b10 [c000000e75ad3b40:thread_info]
c000000e75ad3a30: 7fffffffffffffff [c000000e75ad3c60:thread_info]
c000000e75ad3a40: [c000000e75ad3b00:thread_info] 2480402800000003
c000000e75ad3a50: .lock_hrtimer_base+0x34 0000000000000008
c000000e75ad3a60: d000000006eda700 [c000000ec4c6fa68:thread_info]
c000000e75ad3a70: c000000089704df0 [c000000bba6e7c60:thread_info]
c000000e75ad3a80: 00097618e1b7b906 00097618e1b6f5b6
c000000e75ad3a90: hrtimer_wakeup c000000089704b10
c000000e75ad3aa0: 0000000000000001 0000b31a00000000
c000000e75ad3ab0: .futex_wait_queue_me+0xc0 6a61766100000000
c000000e75ad3ac0: 0000000000000000 [c000000d0f484f60:task_struct]
futex_q
c000000e75ad3ad0: 0000006400000001 [c000000e75ad3ad8:thread_info]
c000000e75ad3ae0: [c000000e75ad3ad8:thread_info] d000000006eda718
c000000e75ad3af0: [c000000d34e6fae8:thread_info] [c000000d0f484f60:task_struct]
c000000e75ad3b00: d000000006eda704 00000000a4160000
c000000e75ad3b10: [c000000feeb0e180:mm_struct] 00000b0800000000
c000000e75ad3b20: 0000000000000000 0000000000000000
c000000e75ad3b30: 0000000000000000 ffffffff00000100
c000000e75ad3b40: 00097618e1b7b906 000976087a15ead2
c000000e75ad3b50: 0000000000000000 0000000000000000
c000000e75ad3b60: 0000000092c7e200 0000000000000000
c000000e75ad3b70: 000000000000ec9a 00000000a3c5abe0
c000000e75ad3b80: 0000000000000000 00000000a4160b08
c000000e75ad3b90: 000000000ffa317c 0000000000000080
c000000e75ad3ba0: 0000000000000000 [c000000e75ad3de0:thread_info]
c000000e75ad3bb0: 0000000000000000 000000003b9aca00
c000000e75ad3bc0: mv88e6131_switch_driver+0xe8b8 00000000a4160b08
crash> futex_q ffff882cebca7c88
struct futex_q {
list = {
prio = 0x64,
plist = {
prio_list = {
next = 0xffffc90028ef0fc8,
prev = 0xffffc90028ef0fc8
},
node_list = {
next = 0xffffc90028ef0fd8,
prev = 0xffffc90028ef0fd8
}
}
},
task = 0xffff882315faf540,
lock_ptr = 0xffffc90028ef0fc4,
key = {
shared = {
pgoff = 0x23ef000,
inode = 0xffff88204c9ad8c0,
offset = 0xb54
},
private = {
address = 0x23ef000,
mm = 0xffff88204c9ad8c0,
offset = 0xb54 <<<<<<<<<<<<<<<
},
both = {
word = 0x23ef000,
ptr = 0xffff88204c9ad8c0,
offset = 0xb54
}
},
pi_state = 0x0,
rt_waiter = 0x0,
requeue_pi_key = 0x0,
bitset = 0xffffffff
}
The offset in futex_q.key.private.offset is critical. The low order 2 bits identify the "type" of key. These bits can be 00, 01 or 10. The problem only concerns type 00. The types are as follows:
00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
(no reference on an inode or mm)
01 : Shared futex (PTHREAD_PROCESS_SHARED)
mapped on a file (reference on the underlying inode)
10 : Shared futex (PTHREAD_PROCESS_SHARED)
(but private mapping on an mm, and reference taken on it)
The address of the futex_hashed_bucket is determined by subtracting 0x4 from the lock_ptr address
crash> futex_hash_bucket ffffc90028ef0fc0
struct futex_hash_bucket {
waiters = {
counter = 0x1 <<<<<<<<<<<<<<<<<
},
lock = {
raw_lock = {
slock = 0x48e948e9
}
},
chain = {
prio_list = {
next = 0xffff882cebca7c90,
prev = 0xffff882cebca7c90
},
node_list = {
next = 0xffff882cebca7ca0,
prev = 0xffff882cebca7ca0
}
}
}
A count of 1 or 2 is more likely for this condition to be attributable to the known Bugzilla problem. It is less likely you will see this condition if the futex_hash_bucket->waiters count is well in excess of 2. There has to be at least 1 of the 2 hashed entry that is PTHREAD_PROCESS_PRIVATE. Most likely you will see 1 waiters.
Reviewing a running system
It is very difficult to accurately diagnose this condition in a running environment. futex_wait() is a normal condition and accessing the kernel futex_q and futex_hashed_bucket structs in a running system to verify the problem is very complex. You have to first identify a process that you believe is stalled. Performing an strace on the stalled process should show the futex_wait() but this may also bring it out of the stalled condition. It has been observed that attaching strace or any other debugger that attaches to the thread, frees it from its locked state. This recovery action cannot be seen as a guarantee that this is the identified bugzilla but is a strong possibility.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.