Why does a system hang on shutdown when a gfs2 filesystem has a withdrawal?
Environment
- Red Hat Enterprise Linux Server 5, 6, 7 (with the High Availability and Resilient Storage Add Ons)
- A Global Filesystem 2(
GFS2)
Issue
- Why does a system hang on shutdown when a gfs2 filesystem has a withdrawal?
Resolution
A hard reboot (such as pushing the power button, or powering off via a systems management card) is required to reboot the cluster node that had the gfs2 withdrawal.
Root Cause
The GFS2 withdraw function is a data integrity feature of GFS2 file systems in a cluster. If the GFS2 kernel module detects an inconsistency in a GFS2 file system following an I/O operation, the file system becomes unavailable to that cluster node(which does not affect the access to the GFS2 file system to the other cluster nodes). In order for the GFS2 file system to properly withdraw then it is required that the file system use a clustered LVM device managed by clvmd.
In some instances of a gfs2 withdrawal commands can hang that are trying to access the filesystem or its block device. For example, an IO withdrawal will cause the umount, lvm, etc commands to hang. An IO withdrawal happens when the filesystem can no longer access the storage device (ex. fence_scsi was issued against the node, fibre switch went down, etc). Below is example of an IO withdrawal after a cluster node was fenced with fence_scsi.
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7978
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7979
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7980
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7981
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7982
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7983
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: Buffer I/O error on device dm-3, logical block 7984
Feb 20 10:52:03 rhel6-1 kernel: lost page write due to I/O error on dm-3
Feb 20 10:52:03 rhel6-1 kernel: GFS2: fsid=rh6cluster:gfs2-1.0: fatal: I/O error
Feb 20 10:52:03 rhel6-1 kernel: GFS2: fsid=rh6cluster:gfs2-1.0: block = 7984
Feb 20 10:52:03 rhel6-1 kernel: GFS2: fsid=rh6cluster:gfs2-1.0: function = log_write_header, file = fs/gfs2/log.c, line = 521
Feb 20 10:52:03 rhel6-1 kernel: GFS2: fsid=rh6cluster:gfs2-1.0: about to withdraw this file system
Feb 20 10:52:04 rhel6-1 kernel: GFS2: fsid=rh6cluster:gfs2-1.0: telling LM to unmount
Feb 20 10:52:04 rhel6-1 kernel: GFS2: fsid=rh6cluster:gfs2-1.0: withdrawn
Feb 20 10:52:04 rhel6-1 kernel: Pid: 24758, comm: gfs2_logd Not tainted 2.6.32-642.6.2.el6.x86_64 #1
Feb 20 10:52:04 rhel6-1 kernel: Call Trace:
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffffa04c3f18>] ? gfs2_lm_withdraw+0x128/0x160 [gfs2]
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff81549d58>] ? out_of_line_wait_on_bit+0x78/0x90
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff810a6920>] ? wake_bit_function+0x0/0x50
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffffa04c3f90>] ? gfs2_io_error_bh_i+0x40/0x50 [gfs2]
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff811d10d6>] ? __wait_on_buffer+0x26/0x30
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffffa04ab4e1>] ? log_write_header+0x2e1/0x470 [gfs2]
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffffa04abb9b>] ? gfs2_log_flush+0x2cb/0x600 [gfs2]
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffffa04abfa9>] ? gfs2_logd+0xd9/0x140 [gfs2]
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffffa04abed0>] ? gfs2_logd+0x0/0x140 [gfs2]
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff810a640e>] ? kthread+0x9e/0xc0
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff810a6370>] ? kthread+0x0/0xc0
Feb 20 10:52:04 rhel6-1 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Some gfs2 filesystem withdrawals can result in the following:
- The init service script
/etc/init.d/gfs2will hang when issuingstopbecause theumountcommand will hang. - A hard reboot is required because a soft reboot (ex.
rebootcommand) will hang because of commands ran on powering down the system will hang such aslvm,umount, etc.
Please note, not all gfs2 withdrawals will cause the umount command to hang or require a hard reboot.
Diagnostic Steps
Review the /var/log/messages file for a withdraw on a gfs2 filesystem.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.