A pacemaker gfs2 filesystem resource failed to stop and umount appear to fail
Environment
- Red Hat Enterprise Linux Server 7 (with the High Availability Add On and Resilient Storage Add Ons)
- A Global Filesystem 2(
gfs2)
Issue
A pacemaker gfs2 filesystem resource failed to stop and umount appear to fail. The umount process appear to hang for a long period of time after the Filesystem resource timed out.
May 2 12:44:07 node42 Filesystem(data)[22988]: INFO: Running stop for /dev/clustervg/data on /data
May 2 12:44:07 node42 Filesystem(data)[22988]: INFO: Trying to unmount /data
May 2 12:45:07 node42 lrmd[5311]: warning: data_stop_0 process (PID 22988) timed out
May 2 12:45:07 node42 lrmd[5311]: warning: data_stop_0:22988 - timed out after 60000ms
May 2 12:45:07 node42 crmd[5314]: error: Result of stop operation for data on node42: Timed Out
[....]
May 2 12:53:22 node42 kernel: INFO: task umount:23282 blocked for more than 120 seconds.
May 2 12:53:22 node42 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 2 12:53:22 node42 kernel: umount D ffff8c00ffd1ac80 0 23282 1 0x10000086
May 2 12:53:22 node42 kernel: Call Trace:
May 2 12:53:22 node42 kernel: [<ffffffffacb80ddd>] ? wait_for_completion+0x11d/0x140
May 2 12:53:22 node42 kernel: [<ffffffffacb80a09>] schedule+0x29/0x70
May 2 12:53:22 node42 kernel: [<ffffffffacb7e458>] schedule_timeout+0x168/0x2d0
May 2 12:53:22 node42 kernel: [<ffffffffac4adce0>] ? __internal_add_timer+0x130/0x130
May 2 12:53:22 node42 kernel: [<ffffffffac4c6dd6>] ? prepare_to_wait+0x56/0x90
May 2 12:53:22 node42 kernel: [<ffffffffc0d0d408>] gfs2_gl_hash_clear+0xa8/0x120 [gfs2]
May 2 12:53:22 node42 kernel: [<ffffffffac4c72e0>] ? wake_up_atomic_t+0x30/0x30
May 2 12:53:22 node42 kernel: [<ffffffffc0d2b792>] gfs2_put_super+0x132/0x1c0 [gfs2]
May 2 12:53:22 node42 kernel: [<ffffffffac64d37d>] generic_shutdown_super+0x6d/0x100
May 2 12:53:22 node42 kernel: [<ffffffffac64d7f7>] kill_block_super+0x27/0x70
May 2 12:53:22 node42 kernel: [<ffffffffc0d193a2>] gfs2_kill_sb+0x72/0x80 [gfs2]
May 2 12:53:22 node42 kernel: [<ffffffffac64db5e>] deactivate_locked_super+0x4e/0x70
May 2 12:53:22 node42 kernel: [<ffffffffac64e2e6>] deactivate_super+0x46/0x60
May 2 12:53:22 node42 kernel: [<ffffffffac66ce5f>] cleanup_mnt+0x3f/0x80
May 2 12:53:22 node42 kernel: [<ffffffffac66cef2>] __cleanup_mnt+0x12/0x20
May 2 12:53:22 node42 kernel: [<ffffffffac4c2d2b>] task_work_run+0xbb/0xe0
May 2 12:53:22 node42 kernel: [<ffffffffac42cc65>] do_notify_resume+0xa5/0xc0
May 2 12:53:22 node42 kernel: [<ffffffffacb8e23b>] int_signal+0x12/0x17
May 2 12:54:11 node42 kernel: G: s:UN n:2/ca0a05d f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
May 2 12:54:11 node42 kernel: G: s:SH n:5/c8649b4 f:DIqob t:SH d:UN/600386000 a:0 v:0 r:3 m:200
May 2 12:54:11 node42 kernel: H: s:SH f:EH e:0 p:1179 [(ended)] gfs2_inode_lookup+0x228/0x440 [gfs2]
May 2 12:54:11 node42 kernel: G: s:UN n:2/c8649b4 f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
May 2 12:53:22 node42 kernel: INFO: task umount:23282 blocked for more than 120 seconds.
May 2 12:53:22 node42 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 2 12:53:22 node42 kernel: umount D ffff8c00ffd1ac80 0 23282 1 0x10000086
May 2 12:53:22 node42 kernel: Call Trace:
May 2 12:53:22 node42 kernel: [<ffffffffacb80ddd>] ? wait_for_completion+0x11d/0x140
May 2 12:53:22 node42 kernel: [<ffffffffacb80a09>] schedule+0x29/0x70
May 2 12:53:22 node42 kernel: [<ffffffffacb7e458>] schedule_timeout+0x168/0x2d0
May 2 12:53:22 node42 kernel: [<ffffffffac4adce0>] ? __internal_add_timer+0x130/0x130
May 2 12:53:22 node42 kernel: [<ffffffffac4c6dd6>] ? prepare_to_wait+0x56/0x90
May 2 12:53:22 node42 kernel: [<ffffffffc0d0d408>] gfs2_gl_hash_clear+0xa8/0x120 [gfs2]
May 2 12:53:22 node42 kernel: [<ffffffffac4c72e0>] ? wake_up_atomic_t+0x30/0x30
May 2 12:53:22 node42 kernel: [<ffffffffc0d2b792>] gfs2_put_super+0x132/0x1c0 [gfs2]
May 2 12:53:22 node42 kernel: [<ffffffffac64d37d>] generic_shutdown_super+0x6d/0x100
May 2 12:53:22 node42 kernel: [<ffffffffac64d7f7>] kill_block_super+0x27/0x70
May 2 12:53:22 node42 kernel: [<ffffffffc0d193a2>] gfs2_kill_sb+0x72/0x80 [gfs2]
May 2 12:53:22 node42 kernel: [<ffffffffac64db5e>] deactivate_locked_super+0x4e/0x70
May 2 12:53:22 node42 kernel: [<ffffffffac64e2e6>] deactivate_super+0x46/0x60
May 2 12:53:22 node42 kernel: [<ffffffffac66ce5f>] cleanup_mnt+0x3f/0x80
May 2 12:53:22 node42 kernel: [<ffffffffac66cef2>] __cleanup_mnt+0x12/0x20
May 2 12:53:22 node42 kernel: [<ffffffffac4c2d2b>] task_work_run+0xbb/0xe0
May 2 12:53:22 node42 kernel: [<ffffffffac42cc65>] do_notify_resume+0xa5/0xc0
May 2 12:53:22 node42 kernel: [<ffffffffacb8e23b>] int_signal+0x12/0x17
May 2 12:54:11 node42 kernel: G: s:UN n:2/ca0a05d f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
May 2 12:54:11 node42 kernel: G: s:SH n:5/c8649b4 f:DIqob t:SH d:UN/600386000 a:0 v:0 r:3 m:200
May 2 12:54:11 node42 kernel: H: s:SH f:EH e:0 p:1179 [(ended)] gfs2_inode_lookup+0x228/0x440 [gfs2]
May 2 12:54:11 node42 kernel: G: s:UN n:2/c8649b4 f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
May 2 12:54:11 node42 kernel: G: s:UN n:2/cffa835 f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
May 2 12:54:11 node42 kernel: G: s:SH n:5/ca0a05d f:DIqob t:SH d:UN/600366000 a:0 v:0 r:3 m:200
May 2 12:54:11 node42 kernel: H: s:SH f:EH e:0 p:30019 [(ended)] gfs2_inode_lookup+0x228/0x440 [gfs2]
May 2 12:54:11 node42 kernel: G: s:SH n:5/cffa835 f:DIqob t:SH d:UN/600365000 a:0 v:0 r:3 m:200
May 2 12:54:11 node42 kernel: H: s:SH f:EH e:0 p:2472 [(ended)] gfs2_inode_lookup+0x228/0x440 [gfs2]
May 2 12:54:11 node42 kernel: dlm: datalock: group leave failed -512 0
May 2 12:54:11 node42 kernel: VFS: Busy inodes after unmount of dm-32. Self-destruct in 5 seconds. Have a nice day...
Resolution
Red Hat Enterprise Linux 7
- The issue is being tracked with bugzilla 1832393: Bug 1832393 - A pacemaker gfs2 filesystem resource failed to stop and umount appear to fail because there was existing references to glock (RHEL 7 7.9.0). As of Wed, November 11 2020, the status of bugzilla 1832393 is CLOSED. This bug has been closed because the problem described is an issue that will not be fixed. An explanation of why this resolution is set to WONTFIX should be in the bugzilla and if you cannot access the bug or you want further information contact Red Hat support.
Workaround
If the gfs2 filesystem will not mount then the cluster node will need to be hard rebooted.
Root Cause
The gfs2 could not be unmounted because there was still references to the glock.
May 2 12:53:22 node42 kernel: [<ffffffffacb8e23b>] int_signal+0x12/0x17
May 2 12:54:11 node42 kernel: G: s:UN n:2/ca0a05d f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
May 2 12:54:11 node42 kernel: G: s:SH n:5/c8649b4 f:DIqob t:SH d:UN/600386000 a:0 v:0 r:3 m:200
May 2 12:54:11 node42 kernel: H: s:SH f:EH e:0 p:1179 [(ended)] gfs2_inode_lookup+0x228/0x440 [gfs2]
May 2 12:54:11 node42 kernel: G: s:UN n:2/c8649b4 f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
The glock locktable data in the logs shows that some of the glocks still had references to the glock. The umount requires that the reference count is zero before unmounting can complete. For more information then see the following article: What happens when a gfs2 filesystem is unmounted?. In above output the glock has a reference count of 2 which is noted by r.
May 2 12:54:11 node42 kernel: G: s:UN n:2/ca0a05d f:Iqo t:UN d:EX/0 a:0 v:0 r:2 m:200
Eventually the umount gave up which meant the filesystem was left mounted and the DLM lockspace was not cleaned up because there was glocks that had reference count greater than 0.
May 2 12:54:11 node42 kernel: dlm: datalock: group leave failed -512 0
May 2 12:54:11 node42 kernel: VFS: Busy inodes after unmount of dm-32. Self-destruct in 5 seconds. Have a nice day...
The cluster node will have to be rebooted since it left a dangling lockspace.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.