How can I view glock contention on a GFS2 filesystem in real-time in a RHEL 5, 6, 7, or 8 Resilient Storage cluster?

Updated

Introduction


`GFS2` can have performance problems due to excessive `glock` contention, so analyzing `glock` usage is one strategy for solving many `GFS2` performance problems. A `glock` is an inter-node lock used by the `GFS2` file system to coordinate file system metadata (e.g. file) changes. This is accomplished with a program called `glocktop`. Its function is to display or print a list of `glocks` that have processes waiting (`waiters`) to lock (`hold`) it.

The tool glocktop is used to debug locking issue on a GFS2 filesystem and needs to be ran when the performance issue is occurring. The gfs2 filesystem is a shared filesystem, so theglocktop utility needs to be ran on every cluster node in the cluster that has the gfs2 filesystems mounted. Ideally the glocktop script will be started on all the cluster nodes simultaneously as the root user so you can determine how the locks are being used between all the cluster nodes for each gfs2 filesystem.

The glocktop program reports several things:

  • A list of each mounted GFS2 file system
  • For each mounted file system, it displays any glocks that have waiters (and the type of glock)
  • For directory glocks it displays the directory path before the glock
  • A list of holder records, showing all the processes on that node who are waiting to hold the glock

NOTE: The utility glocktop has been known to cause problems if the GFS2 filesystem is unmounted while it is running. Make sure glocktop is not running when any GFS2 filesystem is unmounted.

RHEL supported

  • RHEL 6: The glocktop binary was added to gfs2-utils in the following errata: RHBA-2016-0729 package gfs2-utils-3.0.12.1-78.el6 or higher for the channel(s) RHEL High Availability (v.6), RHEL Resilient Storage (v.6).
  • RHEL 7: The glocktop binary was added to gfs2-utils in the following errata: RHBA-2016-2438 for the package(s) gfs2-utils-3.1.9-3.el7 or later for the channel(s) RHEL Desktop (v.7), RHEL HPC Node (v.7), RHEL Server (v.7), RHEL Workstation (v.7).
  • RHEL 8: The glocktop binary is included in gfs2-utils.

Usage


*Please note that the output in the examples might appear a little different than what `glocktop` currently output as new features have been added.*

The usage for glocktop is shown below for more information on options consult the man page:

# glocktop [-i] [-d <delay sec>] [-n <iter>] [-sX] [-c] [-D] [-H] [-r] [-t]

When capturing the data that glocktop generates, glocktop will need to be ran on all cluster nodes at the same time as the root user. There is two ways to capture the information that glocktop gathers:

  • interactive mode which is enabled with -i.
  • stdout out is enabled by default. If the output needs to be saved then use redirection of stdout to a file. In the example below,
# glocktop -r 2>&1 | tee /tmp/glocktop.output.$(hostname).

Example output of interactive mode


Here is an example of `glocktop` output in interactive mode(The first line is the header line, which is only displayed in interactive mode if enabled with `-i` option.
glocktop - GFS2 glock monitor 
work       Thu Jan  9 11:29:07 2014  dlm: 16384/16384/16384 [*         ]
data       Thu Jan  9 11:29:07 2014  
/sasdata/bulked/model_data_calib_ps
G:  s:UN n:2/1b1d3985 f:lIqob t:SH d:EX/0 a:0 v:0 r:3 m:10   (directory inode)
 H: s:SH f:W e:0 p:23391 [sas] gfs2_readdir+0x5a/0xd0 [gfs2]
G:  s:UN n:3/1becce1c f:lqo t:EX d:EX/0 a:0 v:0 r:3 m:10     (108222 free rgrp)
 H: s:EX f:tW e:0 p:15733 [sas] gfs2_inplace_reserve+0x35c/0x980 [gfs2]

The next two lines show GFS2 filesystem work and data followed by a time stamp. Please note this is the name of GFS2 filesystem and not the mount point.

work       Thu Jan  9 11:29:07 2014  dlm: 16384/16384/16384 [*         ]
data       Thu Jan  9 11:29:07 2014  

These two lines indicates that two GFS2 file systems are currently mounted, and the time the glock measurement was taken. After the timestamp, one of the entries shows dlm: followed by some values. This indicates the distributed lock manager (DLM) is busy passing traffic, which may mean there's a lot of network traffic. The values that follow (shown here as 16384/16384/16384) indicate the sizes of the DLM hash tables. Setting the DLM hash tables to a large size (like 16384) may increase performance. The "[* ]" indicates how busy DLM is, waiting for locks to be granted from another node in the cluster: the more asterisks printed, the busier DLM is. A value of "[**********]" means that DLM is swamped with tons of lock requests.

The next line is /sasdata/bulked/model_data_calib_ps. This line indicates that some process is waiting to lock a directory with that path and name, within the file system called "data" (if the directory was for work it would have been listed directly after work instead of after data).

/sasdata/bulked/model_data_calib_ps

The next line, which starts with G: contains the GFS2 glock details for directory /sasdata/bulked/model_data_calib_ps. This is the same format as the GFS2 debugfs file (e.g. /sys/kernel/debug/gfs2/afcEast\:data/glocks), but in addition, if the inode type is available, it will tell you what kind of inode. In this example, it's a directory. Sometimes this value is cached in kernel memory and not available, so glocktop may not report it correctly.

G:  s:UN n:2/1b1d3985 f:lIqob t:SH d:EX/0 a:0 v:0 r:3 m:10   (directory inode)

The first glock field is the glock "state" (in this example, s:UN) is the current state of the glock. The states are:

Glock ModeDescription
UNThe glock is unlocked on this node (possibly locked on another)
SHThe glock is locked on this node in SHARED READ mode.
EXThe glock is locked on this node in EXCLUSIVE mode.
DFThe glock is locked on this node in Concurrent Write mode.

The second glock field is the glock type and glock number (in this example, 2/1b1d3985). The first value (2) indicates the type of glock. The valid glock types are listed in the table below:

Type numberGlock typeUse
1TransTransaction Lock
2InodeInode metadata and data
3Resource groupResource group metadata
4MetaThe superblock
5IopenInode last closer detection
6Flockflock(2) syscall
8QuotaQuota operations
9JournalJournal mutex

The value after the "/" usually indicates a block address (for disk inodes): the location of that file, directory, etc., on disk (in hexadecimal). In this case, the block address of that directory is: 0x1b1d3985.

The next value f:lIqob is the glock flags, followed by other miscellaneous values. For more information, see the following article.

FlagNameMeaning
lLockedThe glock is in the process of changing state.
DDemoteA demote request (local or remote).
dDemote pendingA deferred (remote) demote request.
pDemote in progressThe glock is in the process of responding to a demote request.
yDirtyData needs flushing to disk before releasing this glock.
fLog flushThe log needs to be committed before releasing this glock.
iInvalidate in progressIn the process of invalidating pages under this glock.
rReply pendingReply received from remote node is awaiting processing.
IInitialSet when DLM lock is associated with this glock.
fFrozenReplies from remote nodes ignored - recovery is in progress.
qQueuedhe glock has a holder queued(which will always be set).
oObject attachedAn object attached to the glock (for example, an inode).
bBlocking requestRequest is blocking request.
LLRUA new LRU flag.

The next line in the output, which starts with H:, indicates a process that is either holding that glock, or waiting for the glock. The "s:SH" indicates the process wants the lock in SHARED READ mode. The "f:W" is a flag indicating whether the process is waiting for the lock ("W") or actually holding the lock ("H"). The "p:23391" indicates that process number 23391 is the process that is waiting. It also shows the name of the process, and what GFS2 function is doing the waiting.

 H: s:SH f:W e:0 p:23391 [sas] gfs2_readdir+0x5a/0xd0 [gfs2]

In this example, there is a second glock, 3/1becce1c for which a process is waiting. Since the type value (before the "/") is "3" we know that it's a "resource group" which is a slice of the GFS2 file system. This type of glock is used for block allocations and freeing blocks. For resource group glocks, glocktop also tells you the number of free blocks in that particular resource group. (That can give you an idea of how full or fragmented your file system is).

G:  s:UN n:3/1becce1c f:lqo t:EX d:EX/0 a:0 v:0 r:3 m:10     (108222 free rgrp)

If you see a glock with lots of waiters, that indicates contention. For example:

/sasdata/bulked/attr_reg
G:  s:SH n:2/1b1d39ac f:ldrIqob t:EX d:UN/0 a:0 v:0 r:10 m:2 (directory inode)
 H: s:EX f:W e:0 p:29270 [sas] gfs2_glock_nq_init+0x16/0x40 [gfs2]
 H: s:SH f:AW e:0 p:29138 [sas] gfs2_permission+0xe4/0x100 [gfs2]
 H: s:SH f:AW e:0 p:29140 [sas] gfs2_permission+0xe4/0x100 [gfs2]
 H: s:SH f:AW e:0 p:29139 [sas] gfs2_permission+0xe4/0x100 [gfs2]
 H: s:SH f:AW e:0 p:29143 [sas] gfs2_permission+0xe4/0x100 [gfs2]
 H: s:SH f:AW e:0 p:29142 [sas] gfs2_permission+0xe4/0x100 [gfs2]
 I: n:1826926/454900140 t:4 f:0x00 d:0x00000003 s:2048

This indicates the directory "/sasdata/bulked/attr_reg" has six processes all waiting to lock it. The first one is waiting to lock it in Exclusive mode (to make a change; for example, to create a file in the directory). The other five are waiting for it in Shared Read mode (for example, to read the directory).

In this case, the line starting with I: indicates the disk inode is currently being read from or written to the media.

Identifying points of file system contention (such as the directory shown above) is the first step. If these points of contention are reduced or eliminated, your application will run much faster on GFS2.

Frequently Asked Questions

  • Why does the glocktop outputted file contain filesystems that had duplicate samples taken at the same time? Why is glocktop generating very large outputted files?
    The reason is that command was ran with nohup. Do not run glocktop in the following manner or glocktop will contain duplicate data in the outputted file.

    # nohup glocktop -D -r  > /tmp/glocktop.output.$(hostname) &
    
SBR
Category
Components
Article Type