What data should I gather when access to a GFS2 filesystem appears to be hung or unresponsive on RHEL 5, RHEL 6, or RHEL 7,8,9?

Solution Verified - Updated 7 Aug 2024

Environment

Red Hat Enterprise Linux Server 5, 6, 7, 8, 9 (with the High Availability and Resilient Storage Add Ons)
Global File System 2 (GFS2)

Issue

Processes accessing GFS2file-systems are hung in state D (uninterruptible sleep).
GFS2 is hung
Access to GFS2 hangs
Where do I find gfs2_lockcapture?

Resolution

The python script used to capture GFS2 and DLM lockdump data in order to troubleshoot GFS2 and DLM performance and hung issues. These scripts are provided as is, so use at your own risk.

There is 2 attachments in this article. One attachment is for running in python 2.7 environment (RHEL 5, 6, 7) and the other is for running in python 3.X environment (RHEL 8, 9+).

RHEL 5, 6, 7
- Download and extract the file: gfs2_lockcapture.tar.bz2 which contains a file called gfs2_lockcapture. This script will only run in python 2.7 environments.
RHEL 8, 9+
- Download and extract the file: gfs2_lockcapture-python3.tar.bz2 which contains a file called gfs2_lockcapture-python3. This script will only run in python 3.X environments.

See the Diagnostic Steps below for information on how to collect the data required to troubleshoot these types of issues.

How to Improve GFS/GFS2 File System Performance and Prevent Processes from Hanging
For RHEL 4 and GFS1 you should review the article What data do I gather if processes accessing GFS1 are hung on RHEL4?
For RHEL 5 and GFS1 you should review the article What data should I gather when access to a GFS1 filesystem appears to be hung or unresponsive on RHEL 5?

Root Cause

When processes hang in uninterpretable sleep with GFS2 processes on the stack, it usually indicates the process in question is IO starved waiting for GFS2 to complete an operation. If this process persists for more than a few seconds without correcting itself, this could result in hung_task_timeout calls in the messages logs and it may be a deadlock.

Diagnostic Steps

To diagnose why a GFS2 file-system appears to be hung, unresponsive, or blocked, Red Hat requires specialized data to diagnose the issue. A script has been created called Content from pagure.io is not included.gfs2_lockcapture which is located at the Content from pagure.io is not included.gfs2-utils git repository which will collect all the information required to analyze why a GFS2 file-system appears to be hung or performance is slow.

If using the gfs2_lockcapture-python3.tar.bz2 version for RHEL 8 or RHEL 9+, then for examples below use the filename gfs2_lockcapture-python3.

Save the attachment gfs2_lockcapture.tar.bz2 that is attached to this article and then extract it.

# cd ~/
# tar jxvf ~/gfs2_lockcapture.tar.bz2

While the system appears to be hung, unresponsive, or blocked, run the following command on all the cluster nodes simultaneously. This command assumes that the script is located in the /tmp directory. The command will gather 3 iterations of the lockdump data every 15 seconds. There will be a .tar.bz2 file created in the /tmp directory that contains the data captured for the cluster node that command was ran on. Before running the command, review the help/usage on what each option does which is describe at bottom of the article:

# python /tmp/gfs2_lockcapture -r 3 -s 60 -o /tmp/ -y

The script will gather process information that can sometimes cause performance issues with systems with high loads. The gathering of process data can be disabled with the -P option.

After the command has completed on all cluster nodes, scp the .tar.bz2 lockdump files that were created to one cluster node. Then tar archive them together into 1 archive file. The reason is that you want to keep the data collected for this instance together.

For more information about gfs2_lockcapture options then run the following command or consult the man page. The man page contains information related to the files that are captured and commands that are run on the host:

# python ~/gfs2_lockcapture -h
# man gfs2_lockcapture

In addition to capturing this information, it is often useful to run resource utilization monitoring utilities over a longer period of time. This way, trends in system usage and load can be observed. This usually should include commands like top, vmstat, iostat, mpstat, ps, cat /proc/slabinfo, cat /proc/meminfo, and others like them.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

cluster

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.