How can I view glock contention on a GFS2 filesystem in real-time in a RHEL 5, 6, 7, or 8 Resilient Storage cluster?
Introduction
`GFS2` can have performance problems due to excessive `glock` contention, so analyzing `glock` usage is one strategy for solving many `GFS2` performance problems. A `glock` is an inter-node lock used by the `GFS2` file system to coordinate file system metadata (e.g. file) changes. This is accomplished with a program called `glocktop`. Its function is to display or print a list of `glocks` that have processes waiting (`waiters`) to lock (`hold`) it.
The tool glocktop is used to debug locking issue on a GFS2 filesystem and needs to be ran when the performance issue is occurring. The gfs2 filesystem is a shared filesystem, so theglocktop utility needs to be ran on every cluster node in the cluster that has the gfs2 filesystems mounted. Ideally the glocktop script will be started on all the cluster nodes simultaneously as the root user so you can determine how the locks are being used between all the cluster nodes for each gfs2 filesystem.
The glocktop program reports several things:
- A list of each mounted
GFS2file system - For each mounted file system, it displays any
glocksthat have waiters (and the type ofglock) - For directory
glocksit displays the directory path before theglock - A list of holder records, showing all the processes on that node who are waiting to hold the
glock
NOTE: The utility glocktop has been known to cause problems if the GFS2 filesystem is unmounted while it is running. Make sure glocktop is not running when any GFS2 filesystem is unmounted.
RHEL supported
- RHEL 6: The
glocktopbinary was added togfs2-utilsin the following errata: RHBA-2016-0729 packagegfs2-utils-3.0.12.1-78.el6or higher for the channel(s) RHEL High Availability (v.6), RHEL Resilient Storage (v.6). - RHEL 7: The
glocktopbinary was added togfs2-utilsin the following errata: RHBA-2016-2438 for the package(s)gfs2-utils-3.1.9-3.el7or later for the channel(s) RHEL Desktop (v.7), RHEL HPC Node (v.7), RHEL Server (v.7), RHEL Workstation (v.7). - RHEL 8: The
glocktopbinary is included ingfs2-utils.
Usage
*Please note that the output in the examples might appear a little different than what `glocktop` currently output as new features have been added.*
The usage for glocktop is shown below for more information on options consult the man page:
# glocktop [-i] [-d <delay sec>] [-n <iter>] [-sX] [-c] [-D] [-H] [-r] [-t]
When capturing the data that glocktop generates, glocktop will need to be ran on all cluster nodes at the same time as the root user. There is two ways to capture the information that glocktop gathers:
- interactive mode which is enabled with -i.
- stdout out is enabled by default. If the output needs to be saved then use redirection of
stdoutto a file. In the example below,
# glocktop -r 2>&1 | tee /tmp/glocktop.output.$(hostname).
Example output of interactive mode
Here is an example of `glocktop` output in interactive mode(The first line is the header line, which is only displayed in interactive mode if enabled with `-i` option.
glocktop - GFS2 glock monitor
work Thu Jan 9 11:29:07 2014 dlm: 16384/16384/16384 [* ]
data Thu Jan 9 11:29:07 2014
/sasdata/bulked/model_data_calib_ps
G: s:UN n:2/1b1d3985 f:lIqob t:SH d:EX/0 a:0 v:0 r:3 m:10 (directory inode)
H: s:SH f:W e:0 p:23391 [sas] gfs2_readdir+0x5a/0xd0 [gfs2]
G: s:UN n:3/1becce1c f:lqo t:EX d:EX/0 a:0 v:0 r:3 m:10 (108222 free rgrp)
H: s:EX f:tW e:0 p:15733 [sas] gfs2_inplace_reserve+0x35c/0x980 [gfs2]
The next two lines show GFS2 filesystem work and data followed by a time stamp. Please note this is the name of GFS2 filesystem and not the mount point.
work Thu Jan 9 11:29:07 2014 dlm: 16384/16384/16384 [* ]
data Thu Jan 9 11:29:07 2014
These two lines indicates that two GFS2 file systems are currently mounted, and the time the glock measurement was taken. After the timestamp, one of the entries shows dlm: followed by some values. This indicates the distributed lock manager (DLM) is busy passing traffic, which may mean there's a lot of network traffic. The values that follow (shown here as 16384/16384/16384) indicate the sizes of the DLM hash tables. Setting the DLM hash tables to a large size (like 16384) may increase performance. The "[* ]" indicates how busy DLM is, waiting for locks to be granted from another node in the cluster: the more asterisks printed, the busier DLM is. A value of "[**********]" means that DLM is swamped with tons of lock requests.
The next line is /sasdata/bulked/model_data_calib_ps. This line indicates that some process is waiting to lock a directory with that path and name, within the file system called "data" (if the directory was for work it would have been listed directly after work instead of after data).
/sasdata/bulked/model_data_calib_ps
The next line, which starts with G: contains the GFS2 glock details for directory /sasdata/bulked/model_data_calib_ps. This is the same format as the GFS2 debugfs file (e.g. /sys/kernel/debug/gfs2/afcEast\:data/glocks), but in addition, if the inode type is available, it will tell you what kind of inode. In this example, it's a directory. Sometimes this value is cached in kernel memory and not available, so glocktop may not report it correctly.
G: s:UN n:2/1b1d3985 f:lIqob t:SH d:EX/0 a:0 v:0 r:3 m:10 (directory inode)
The first glock field is the glock "state" (in this example, s:UN) is the current state of the glock. The states are:
| Glock Mode | Description |
|---|---|
| UN | The glock is unlocked on this node (possibly locked on another) |
| SH | The glock is locked on this node in SHARED READ mode. |
| EX | The glock is locked on this node in EXCLUSIVE mode. |
| DF | The glock is locked on this node in Concurrent Write mode. |
The second glock field is the glock type and glock number (in this example, 2/1b1d3985). The first value (2) indicates the type of glock. The valid glock types are listed in the table below:
| Type number | Glock type | Use |
|---|---|---|
| 1 | Trans | Transaction Lock |
| 2 | Inode | Inode metadata and data |
| 3 | Resource group | Resource group metadata |
| 4 | Meta | The superblock |
| 5 | Iopen | Inode last closer detection |
| 6 | Flock | flock(2) syscall |
| 8 | Quota | Quota operations |
| 9 | Journal | Journal mutex |
The value after the "/" usually indicates a block address (for disk inodes): the location of that file, directory, etc., on disk (in hexadecimal). In this case, the block address of that directory is: 0x1b1d3985.
The next value f:lIqob is the glock flags, followed by other miscellaneous values. For more information, see the following article.
| Flag | Name | Meaning |
|---|---|---|
| l | Locked | The glock is in the process of changing state. |
| D | Demote | A demote request (local or remote). |
| d | Demote pending | A deferred (remote) demote request. |
| p | Demote in progress | The glock is in the process of responding to a demote request. |
| y | Dirty | Data needs flushing to disk before releasing this glock. |
| f | Log flush | The log needs to be committed before releasing this glock. |
| i | Invalidate in progress | In the process of invalidating pages under this glock. |
| r | Reply pending | Reply received from remote node is awaiting processing. |
| I | Initial | Set when DLM lock is associated with this glock. |
| f | Frozen | Replies from remote nodes ignored - recovery is in progress. |
| q | Queued | he glock has a holder queued(which will always be set). |
| o | Object attached | An object attached to the glock (for example, an inode). |
| b | Blocking request | Request is blocking request. |
| L | LRU | A new LRU flag. |
The next line in the output, which starts with H:, indicates a process that is either holding that glock, or waiting for the glock. The "s:SH" indicates the process wants the lock in SHARED READ mode. The "f:W" is a flag indicating whether the process is waiting for the lock ("W") or actually holding the lock ("H"). The "p:23391" indicates that process number 23391 is the process that is waiting. It also shows the name of the process, and what GFS2 function is doing the waiting.
H: s:SH f:W e:0 p:23391 [sas] gfs2_readdir+0x5a/0xd0 [gfs2]
In this example, there is a second glock, 3/1becce1c for which a process is waiting. Since the type value (before the "/") is "3" we know that it's a "resource group" which is a slice of the GFS2 file system. This type of glock is used for block allocations and freeing blocks. For resource group glocks, glocktop also tells you the number of free blocks in that particular resource group. (That can give you an idea of how full or fragmented your file system is).
G: s:UN n:3/1becce1c f:lqo t:EX d:EX/0 a:0 v:0 r:3 m:10 (108222 free rgrp)
If you see a glock with lots of waiters, that indicates contention. For example:
/sasdata/bulked/attr_reg
G: s:SH n:2/1b1d39ac f:ldrIqob t:EX d:UN/0 a:0 v:0 r:10 m:2 (directory inode)
H: s:EX f:W e:0 p:29270 [sas] gfs2_glock_nq_init+0x16/0x40 [gfs2]
H: s:SH f:AW e:0 p:29138 [sas] gfs2_permission+0xe4/0x100 [gfs2]
H: s:SH f:AW e:0 p:29140 [sas] gfs2_permission+0xe4/0x100 [gfs2]
H: s:SH f:AW e:0 p:29139 [sas] gfs2_permission+0xe4/0x100 [gfs2]
H: s:SH f:AW e:0 p:29143 [sas] gfs2_permission+0xe4/0x100 [gfs2]
H: s:SH f:AW e:0 p:29142 [sas] gfs2_permission+0xe4/0x100 [gfs2]
I: n:1826926/454900140 t:4 f:0x00 d:0x00000003 s:2048
This indicates the directory "/sasdata/bulked/attr_reg" has six processes all waiting to lock it. The first one is waiting to lock it in Exclusive mode (to make a change; for example, to create a file in the directory). The other five are waiting for it in Shared Read mode (for example, to read the directory).
In this case, the line starting with I: indicates the disk inode is currently being read from or written to the media.
Identifying points of file system contention (such as the directory shown above) is the first step. If these points of contention are reduced or eliminated, your application will run much faster on GFS2.
Frequently Asked Questions
-
Why does the
glocktopoutputted file contain filesystems that had duplicate samples taken at the same time? Why isglocktopgenerating very large outputted files?
The reason is that command was ran withnohup. Do not runglocktopin the following manner orglocktopwill contain duplicate data in the outputted file.# nohup glocktop -D -r > /tmp/glocktop.output.$(hostname) &