GFS2 unmount blocks in dlm_release_lockspace when issuing a reboot of RHEL 6 cluster node

Solution Unverified - Updated 6 Aug 2024

Environment

Red Hat Enterprise Linux (RHEL) 6 Update 4 with the Reslient Storage Add On
- kernel-2.6.32-358.el6
GFS2

Issue

Occasionally, when one node is shutdown, rgmanager does not exit and we see hung tasks in /var/log/messages for an unmount command waiting in dlm_release_lockspace
gfs2 filesystems do not unmount.
A hard power shutdown of one node will leave the other node unable to access gfs2 filesystems.

kernel: INFO: task umount:59023 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: umount        D 0000000000000006     0 59023  57463 0x00000000
kernel: ffff882026f6dce8 0000000000000082 0000000000000000 0000000000000000
kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
kernel: ffff881291ead058 ffff882026f6dfd8 000000000000fb88 ffff881291ead058
kernel: Call Trace:
kernel: [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
kernel: [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
kernel: [<ffffffffa047bea8>] dlm_release_lockspace+0x48/0x480 [dlm]
kernel: [<ffffffffa062b757>] gdlm_unmount+0x27/0x40 [gfs2]
kernel: [<ffffffffa0619821>] gfs2_lm_unmount+0x21/0x30 [gfs2]
kernel: [<ffffffffa0628284>] gfs2_put_super+0x184/0x220 [gfs2]
kernel: [<ffffffff8118326b>] generic_shutdown_super+0x5b/0xe0
kernel: [<ffffffff81183321>] kill_block_super+0x31/0x50
kernel: [<ffffffffa06198a3>] gfs2_kill_sb+0x73/0x80 [gfs2]
kernel: [<ffffffff81183af7>] deactivate_super+0x57/0x80
kernel: [<ffffffff811a1b6f>] mntput_no_expire+0xbf/0x110
kernel: [<ffffffff811a25db>] sys_umount+0x7b/0x3a0
kernel: [<ffffffff81086391>] ? sigprocmask+0x71/0x110
kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Resolution

It appears that something went bonkers in the dlm_controld. This cause rgmanager to get stuck trying release its lockspace. Since it is serialized, the unmounting of gfs2 got stuck behind rgmanager, then another umount of gfs2 got stuck behind rgmanager, first umount of gfs2.

Diagnostic Steps

Enable dlm and rgmanager debugging as described in the following article: How do I configure logging for the various components of a RHEL 6 High Availability or Resilient Storage cluster?
When the issue occurs capture the state of rgmanager DLM locks on all your cluster nodes. Run the following command on each cluster node:

# dlm_tool lockdebug -v -s -w rgmanager > dlm_tool-lockdebug-rgmanager.$(hostname)

When the issue occurs capture a sosreport. Make sure that the sosreport is captured before the cluster node is rebooted and that sysrq -t is ran to dump the thread state for all processes to the file /var/log/messages. In addition, disable the following sosreport plugins from running filesys , and devicemapper or they could cause sosreport to hang:

# sosreport -n devicemapper,filesys

NOTE: If sysrq -t cannot be ran then capture the /proc/<int> for any cluster related process so that the backtrace can be reviewed.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

Category

Troubleshoot

Tags

gfs2

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.