A GFS2 file-system withdraw on fatal: I/O error on RHEL 5, 6, 7, 8
Environment
- Red Hat Enterprise Linux Server 5, 6. 7, 8 (with the High Availability and Resilient Storage Add Ons)
- A GFS2 filesystem.
Issue
- A GFS2 file-system withdraw on fatal: I/O error occurred:
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: fatal: I/O error
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: block = 57047
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: function = log_write_header, file = fs/gfs2/log.c, line = 521
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: about to withdraw this file system
Nov 18 01:18:54 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: telling LM to unmount
Nov 18 01:18:54 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: withdrawn
Resolution
A GFS2 withdraw occurred on the cluster node which will require the cluster node to be rebooted in order for the GFS2 file-system to be remounted. In addition, the storage issue needs to resolved or the GFS2 will not be mounted or another withdraw will eventually occur(after the cluster node is rebooted).
For more information on GFS2 and withdraws then see the following articles:
- The GFS2 Withdraw Function
- Why does Red Hat not support GFS or GFS2 filesystems directly on disks or partitions and requires LVM?
- Why does a system hang on shutdown when a gfs2 filesystem has a withdrawal?
- GFS or GFS2 file system withdraws after temporary failure of all paths in multipath map in RHEL
- An IO storage errors occurs while writing to GFS2 filesystem journal and a withdraw is not triggered on RHEL 6, 7
Root Cause
The reason for the withdrawal is an I/O error was received from the storage. When this occurred, it is considered a fatal error if the filesystem is accessed when these are occurring. In order to protect the data, the gfs2 filesystem is withdrawn, making it inaccessible for further usage. After a withdrawal, the tasks (or processes) may block or return errors when accessing the gfs2 file-system; this is not unexpected.
Here is an example of a fatal error that would cause a gfs2 file-system to withdraw:
Nov 18 01:10:20 node42 kernel: sd 7:0:0:11: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: sd 7:0:0:11: [sdbo] killing request
Nov 18 01:10:20 node42 kernel: sd 7:0:0:11: [sdbo] Unhandled error code
Nov 18 01:10:20 node42 kernel: sd 7:0:0:11: [sdbo] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 18 01:10:20 node42 kernel: sd 7:0:0:11: [sdbo] CDB: Write(10): 2a 00 00 08 00 92 00 00 01 00
Nov 18 01:10:20 node42 kernel: device-mapper: multipath: Failing path 68:32.
Nov 18 01:10:20 node42 kernel: rport-1:0-3: blocked FC remote port time out: removing target and saving binding
Nov 18 01:10:20 node42 kernel: sd 2:0:0:11: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: device-mapper: multipath: Failing path 65:224.
Nov 18 01:10:20 node42 kernel: sd 2:0:0:11: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: device-mapper: multipath: Failing path 65:224.
Nov 18 01:10:20 node42 kernel: sd 1:0:1:1: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: sd 1:0:1:1: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: device-mapper: multipath: Failing path 8:176.
Nov 18 01:10:20 node42 kernel: sd 1:0:1:5: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: sd 1:0:1:11: rejecting I/O to offline device
Nov 18 01:10:20 node42 kernel: device-mapper: multipath: Failing path 65:64.
Nov 18 01:10:20 node42 kernel: lpfc 0000:11:00.0: 0:(0):0203 Devloss timeout on WWPN 20:35:00:80:e5:36:87:70 NPort x022900 Data: x0 x8 x0
Nov 18 01:10:20 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: gfs2_quotad: statfs error -5
[.....]
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: fatal: I/O error
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: block = 57047
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: function = log_write_header, file = fs/gfs2/log.c, line = 521
Nov 18 01:18:53 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: about to withdraw this file system
Nov 18 01:18:54 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: telling LM to unmount
Nov 18 01:18:54 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: withdrawn
A fatal error in the storage layer will not trigger a gfs2 filesystem to instantly withdraw. The gfs2 withdraw will only occur after the fatal error has occurred and does not recover before the gfs2 file-system is accessed again. It is possible that the fatal error could have happened minutes, hours, or days before the withdraw was triggered.
The errors below usually occur after a gfs2 filesystem has withdrawn. In some instances, (like the one above) can could occur before the IO withdrawal. An update to the kernel will cause these to throw an IO withdrawal when storage cannot be accessed.
Nov 18 01:10:20 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: gfs2_quotad: statfs error -5
Nov 18 01:10:20 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: Error -5 writing to log
Another example is when a storage event occurs, but an IO withdraw is not thrown. In this instance corruption can occur which will cause another type of withdrawal to occur.
Sep 11 09:39:30 node42 kernel: Buffer I/O error on device dm-7, logical block 437133518
Sep 11 09:39:30 node42 kernel: lost page write due to I/O error on dm-7
Sep 11 09:39:30 node42 kernel: Buffer I/O error on device dm-7, logical block 437133519
Sep 11 09:39:30 node42 kernel: lost page write due to I/O error on dm-7
Sep 11 09:39:30 node42 kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 11 09:39:30 node42 kernel: sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 d0 71 17 e8 00 00 10 00
Sep 11 09:39:30 node42 kernel: end_request: I/O error, dev sdc, sector 3497072616
Sep 11 09:39:30 node42 kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 11 09:39:30 node42 kernel: sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 d0 71 17 00 00 00 10 00
Sep 11 09:39:30 node42 kernel: end_request: I/O error, dev sdc, sector 3497072384
Sep 11 09:39:30 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: fatal: invalid metadata block
Sep 11 09:39:30 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: bh = 41294920 (magic number)
Sep 11 09:39:30 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: function = get_leaf, file = fs/gfs2/dir.c, line = 819
Sep 11 09:39:30 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: about to withdraw this file system
Sep 11 09:39:31 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: jid=0: Trying to acquire journal lock...
Sep 11 09:39:31 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: telling LM to unmount
Sep 11 09:39:36 node42 kernel: GFS2: fsid=mycluster:mygfs2.2: withdrawn
The other cluster node did detect a storage event occurred and an IO withdrawal was thrown.
Sep 11 09:39:31 node43 kernel: sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 11 09:39:31 node43 kernel: sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 99 f4 62 b0 00 00 10 00
Sep 11 09:39:31 node43 kernel: end_request: I/O error, dev sdc, sector 2582930096
Sep 11 09:39:31 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: fatal: I/O error
Sep 11 09:39:31 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: block = 118669
Sep 11 09:39:31 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: function = log_write_header, file = fs/gfs2/log.c, line = 521
Sep 11 09:39:31 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: about to withdraw this file system
Sep 11 09:39:31 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: telling LM to unmount
Sep 11 09:39:31 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: withdrawn
There can be different withdraws that can be thrown if the IO withdrawal is not thrown first. Another example is when a storage event occurs, but an IO withdraw is not thrown. In this instance corruption can occur which will cause another type of withdrawal to occur.
Sep 12 20:56:25 node43 kernel: Buffer I/O error on device dm-7, logical block 48299740
Sep 12 20:56:25 node43 kernel: lost page write due to I/O error on dm-7
Sep 12 20:56:25 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: fatal: invalid metadata block
Sep 12 20:56:25 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: bh = 374507701 (magic number)
Sep 12 20:56:25 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 365
Sep 12 20:56:25 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: about to withdraw this file system
Sep 12 20:56:25 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: gfs2_delete_inode: -5
Sep 12 20:56:27 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: telling LM to unmount
Sep 12 20:56:30 node43 kernel: GFS2: fsid=mycluster:mygfs2.3: withdrawn
The other cluster node did detect a storage event occurred and an IO withdrawal was thrown.
Sep 12 21:02:03 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: fatal: I/O error
Sep 12 21:02:03 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: block = 318702547
Sep 12 21:02:03 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: function = gfs2_ail1_start_one, file = fs/gfs2/log.c, line = 110
Sep 12 21:02:03 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: about to withdraw this file system
Sep 12 21:02:04 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: telling LM to unmount
Sep 12 21:02:12 node42 kernel: GFS2: fsid=mycluster:mygfs2.1: withdrawn
Diagnostic Steps
Review the /var/log/messages file for io withdrawal and any errors that occurred on the storage devices(including scsi, multipath, and lvm devices) that are used for the gfs2 filesystem around the time of the withdrawal.
In addition, check to see if storage fencing is configured (fence_scsi or fence_mpath). If a cluster node is fenced with storage fencing then the cluster node will lose access to storage devices and this can cause an IO withdraw on a gfs2 filesystem.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.