qemu-kvm guests panic due to hung task time-out caused by a missing memory barrier in QEMU's AIO code
Environment
- Red Hat Enterprise Linux.
- Red Hat OpenStack Platform.
- Red Hat Enterprise Virtualisation.
qemu-kvm-rhev-2.1.2-23.el7_1.9.x86_64
Issue
- qemu-kvm guests panic due to hung task time-out.
- Guests do not restart after the panic.
virsh dumpinitiated on the hypervisor after the panic, does not complete.
Resolution
This issue is resolved within RHEL 7.2 qemu-kvm-rhev due to a QEMU re-base from 2.1.2 to 2.3.0 as part of that minor release of RHEL. Memory barriers are fixed within QEMU 2.3.0. With the fix in place, QEMU notes completion of all I/O request correctly.
Root Cause
The virtio block driver receives an I/O request from a higher layer in the guest kernel's I/O stack. The driver passes this request to the virtio block emulation in qemu-kvm via a command ring buffer. The emulator code picks up the request from the command ring and hands off processing of the request to an 'AIO worker thread' (the qemu-kvm process maintains a pool of such worker threads). The worker thread eventually executes a system call such as pread(), pwrite() or fdatasync() to perform I/O on the virtual disk image. After return from the system call, the worker thread sets the processing state of the request to THREAD_DONE and notifies the virtio block emulation in qemu-kvm about the completion of the request.
Symptoms seen in guest memory dumps and core dumps of qemu-kvm processes have led to the conclusion that this notification is probably lost during a race between the worker thread and the main thread of qemu-kvm, so the request remains in a semi-complete state. The consequence is that other I/O requests for the same virtual disk can get blocked by this request, in particular if the virtual disk image format is qcow2 (which may require reading and writing image metadata in order to process an I/O request). From the view of the guest kernel it looks like I/O to the device is stalled. The processes in the guest kernel that are waiting on I/O requests to this virtual disk are eventually detected and reported by khungtaskd.
The suspected root cause of lost completion notification is an issue with memory barriers in QEMU AIO code as outlined in the change log of upstream commit Content from git.qemu.org is not included.e8d3b1a25f284cdf9705b7cf0412281cc9ee3a36.
Diagnostic Steps
Use the gcore command to capture a core dump of the qemu-kvm process that contains the hung/crashed guest. Use the gdb debugger to examine the core dump as shown in the following example.
- A typical symptom of this issue is that the request list of the AIO worker thread pool contains only one element, and the processing state of the element is
THREAD_DONE. Usually this element represents a 'flush' type request - however, the request type can vary depending on the I/O activity in the guest prior to the hang/crash.
(gdb) print qemu_aio_context.thread_pool.head.lh_first
$1 = (struct ThreadPoolElement *) 0x7f78ecc4e000 ----.
v
(gdb) set $req = (struct ThreadPoolElement *)0x7f78ecc4e000
(gdb) while $req
>print $req
>print *(struct ThreadPoolElement *)$req
>set $req = ((struct ThreadPoolElement *)$req)->all.le_next
>end
$2 = (struct ThreadPoolElement *) 0x7f78ecc4e000 <---------- note: there is only one element in the list
$3 = {
common = {
aiocb_info = 0x7f78ebd0fa60 <thread_pool_aiocb_info>,
bs = 0x0,
cb = 0x7f78eb8e3a10 <bdrv_co_io_em_complete>,
opaque = 0x7f78eb605eb0 -----------------.
}, |
pool = 0x7f78ec9f5bf0, |
func = 0x7f78eb91f270 <aio_worker>, |
arg = 0x7f78eccd2600, |
state = THREAD_DONE, <---------------------)-------------- note: the processing state is THREAD_DONE
ret = 0x0, |
reqs = { |
tqe_next = 0x0, |
tqe_prev = 0x7f78ec9f5cc0 |
}, |
all = { |
le_next = 0x0, |
le_prev = 0x7f78ec9f5cb8 |
} |
} |
v
(gdb) print *(CoroutineIOCompletion *)0x7f78eb605eb0
$4 = {
coroutine = 0x7f78ec9f59b0, ---.
ret = 0x0 |
} |
v
(gdb) print *(Coroutine *)0x7f78ec9f59b0
$5 = {
entry = 0x7f78eb8e8cb0 <bdrv_aio_flush_co_entry>,
entry_arg = 0x7f78f291bf10, ^^^^^-------------------- note: this example shows a 'flush' request,
caller = 0x0, but the request type could be different
pool_next = {
sle_next = 0x7f78eccb0760
},
co_queue_wakeup = {
tqh_first = 0x0,
tqh_last = 0x7f78ec9f59d0
},
co_queue_next = {
tqe_next = 0x0,
tqe_prev = 0x7f78eccb08a0
}
}
- Another typical symptom of this issue is that the main thread of the qemu-kvm process is stuck in the
bdrv_drain_all()function. This symptom is usually present if previously an attempt was made to capture a memory dump of the guest, using thevirsh dumpcommand. It can also be present if the guest's kdump kernel failed to boot after the guest incurred a panic due to hung task time-out. Use thethread apply all btcommand ingdbto display the stack traces of all threads and look for snippets similar to this example.
(gdb) thread apply all bt
[...]
#0 0x00007f78e40b5c3f in __GI_ppoll (fds=0x7f78ec9f61e0, nfds=0x1, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
at ../sysdeps/unix/sysv/linux/ppoll.c:56
#1 0x00007f78eb8edf6b in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at qemu-timer.c:314
#3 0x00007f78eb8eecc0 in aio_poll (ctx=ctx@entry=0x7f78ec9cb000, blocking=blocking@entry=0x1) at aio-posix.c:250
#4 0x00007f78eb8e53bf in bdrv_drain_all () at block.c:1925
...
#17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4606
[...]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.