Why do lvm commands hang, GFS/GFS2 file systems become unresponsive, and cluster operations hang when fencing is failing in a RHEL High Availability cluster?
Environment
- Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8, 9 with High Availability Add-On
Issue
-
GFS2filesystem hang, how can I recover from this situation? What's the solution? -
With
clvmd, runningpvscanand lvm commands completely hang.# pvscan -vv Setting global/locking_type to 3 Setting global/wait_for_locks to 1 Cluster locking selected. -
Why does access to GFS or GFS2 file systems hang when fencing is failing?
-
Why is my cluster inoperable after a node has crashed or become unresponsive?
-
After a node failed to be fenced by the cluster, we could not start
clvmdon any nodes:Oct 25 08:25:51 node1 fenced[4671]: fence node3.example.com failed Oct 25 08:25:54 node1 fenced[4671]: fencing node node3.example.com Oct 25 08:26:11 node1 fenced[4671]: fence node3.example.com dev 0.0 agent fence_ipmilan result: error from agent # service clvmd start Starting clvmd: clvmd startup timed out -
Services and processes like
rgmanager,clvmd,gfs2_quotad, or other processes accessing or using the cluster infrastructure become blocked after fencing fails in a cluster:Apr 7 22:55:03 node1 kernel: INFO: task rgmanager:6739 blocked for more than 120 seconds. Apr 7 22:55:03 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 7 22:55:03 node1 kernel: rgmanager D ffff88071fc28400 0 6739 6738 0x10000080 Apr 7 22:55:03 node1 kernel: ffff880d85b59d48 0000000000000082 0000000000000000 ffff880d8cf19918 Apr 7 22:55:03 node1 kernel: ffff880d85b02d80 0000000000000001 ffff880d85b01ff8 00007fffbffb00e0 Apr 7 22:55:03 node1 kernel: ffff880d85b21a78 ffff880d85b59fd8 000000000000f4e8 ffff880d85b21a78 Apr 7 22:55:03 node1 kernel: Call Trace: Apr 7 22:55:03 node1 kernel: [<ffffffff814ee0ae>] __mutex_lock_slowpath+0x13e/0x180 Apr 7 22:55:03 node1 kernel: [<ffffffff814edf4b>] mutex_lock+0x2b/0x50 Apr 7 22:55:03 node1 kernel: [<ffffffffa076c92c>] dlm_new_lockspace+0x3c/0xa30 [dlm] Apr 7 22:55:03 node1 kernel: [<ffffffff8115f40c>] ? __kmalloc+0x20c/0x220 Apr 7 22:55:03 node1 kernel: [<ffffffffa077594d>] device_write+0x30d/0x7d0 [dlm] Apr 7 22:55:03 node1 kernel: [<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160 Apr 7 22:55:03 node1 kernel: [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 Apr 7 22:55:03 node1 kernel: [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20 Apr 7 22:55:03 node1 kernel: [<ffffffff811765d8>] vfs_write+0xb8/0x1a0 Apr 7 22:55:03 node1 kernel: [<ffffffff81176fe1>] sys_write+0x51/0x90 Apr 7 22:55:03 node1 kernel: [<ffffffff8100b308>] tracesys+0xd9/0xde [...] Apr 7 23:19:30 node1 kernel: INFO: task clvmd:5603 blocked for more than 120 seconds. Apr 7 23:19:30 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 7 23:19:30 node1 kernel: clvmd D ffff88071fc28400 0 5603 1 0x10000080 Apr 7 23:19:30 node1 kernel: ffff88070a983d48 0000000000000086 0000000000000000 ffff8806f6149700 Apr 7 23:19:30 node1 kernel: ffff880d8d5842a8 0000000000000000 ffff88070a742cd8 00007f49b3655d50 Apr 7 23:19:30 node1 kernel: ffff8806f629fa78 ffff88070a983fd8 000000000000f4e8 ffff8806f629fa78 Apr 7 23:19:30 node1 kernel: Call Trace: Apr 7 23:19:30 node1 kernel: [<ffffffff814ee0ae>] __mutex_lock_slowpath+0x13e/0x180 Apr 7 23:19:30 node1 kernel: [<ffffffff814edf4b>] mutex_lock+0x2b/0x50 Apr 7 23:19:30 node1 kernel: [<ffffffffa077192c>] dlm_new_lockspace+0x3c/0xa30 [dlm] Apr 7 23:19:30 node1 kernel: [<ffffffff8115f40c>] ? __kmalloc+0x20c/0x220 Apr 7 23:19:30 node1 kernel: [<ffffffffa077a94d>] device_write+0x30d/0x7d0 [dlm] Apr 7 23:19:30 node1 kernel: [<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160 Apr 7 23:19:30 node1 kernel: [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 Apr 7 23:19:30 node1 kernel: [<ffffffff8120c3c6>] ? security_file_permission+0x16/0x20 Apr 7 23:19:30 node1 kernel: [<ffffffff811765d8>] vfs_write+0xb8/0x1a0 Apr 7 23:19:30 node1 kernel: [<ffffffff81176fe1>] sys_write+0x51/0x90 Apr 7 23:19:30 node1 kernel: [<ffffffff8100b308>] tracesys+0xd9/0xde
- We have 2 node Red Hat Cluster with GFS filesystems. This systems once a week must be rebooted due to GFS hangs, if a member goes down the cluster gets down and must be rebooted?
- GFS2 filesystem goes into hung state frequently and cluster nodes needs to rebooted to recover from the situation?
- GFS2 Filesystem getting hung
Resolution
-
Configure a working fence/STONITH method for every node in the cluster.
-
If there is a fence device configured for a node that is no longer a member of the cluster, determine why it is failing and fix it.
-
If the missing node is failing to be fenced for some reason, reboot it and have it rejoin the cluster, in order to allow the rest of the cluster to resume activity.
Root Cause
-
When a node has become unresponsive, has missed sending its token, or has been removed for the cluster for any reason, it must be fenced. The cluster will not proceed with any operations, such as allowing access to clustered file systems, failing over cluster services, or allowing LVM commands to be run (when using
clvmd) until a node's fence method has completed successfully. -
When no fence device is configured for a node or the configured fence method is unsuccessful, and that node must be removed from the cluster, the cluster will repeatedly attempt to fence that node until it is successful or until it rejoins after a complete restart (ie reboot). This will cause cluster services such as
clvmd,rgmanager,cmirror, and GFS/GFS2 file systems to become inoperable, and may manifest itself as commands that rely on the cluster infrastructure hanging. -
Fencing is a fundamental part of the Red Hat Cluster infrastructure and it is therefore important to validate or test that fencing is working properly. Without a fence device configured data integrity cannot be guaranteed and the cluster configuration will be unsupported. Refer What is fencing and why is it important? for more details and to correct the configuration.
Diagnostic Steps
- Check
/var/log/messagesor/var/log/cluster/fenced.log(RHEL 6) for signs that fencing is failing. - Check
cman_tool statusto see if the cluster still has enough votes for quorum and the cluster us quorate.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.