How can I use the GFS2 tracepoints and debugfs glocks file in RHEL6?
- Introduction
- Types of GFS2 tracepoint
- Tracepoints
- Glocks
- Glock tracepoints
- Bmap tracepoints
- Log tracepoints
- References
Introduction
This article covers both the glock debugfs interface and the GFS2 tracepoints. It is aimed at those who are already familiar with filesystem internals in a generic way but would like to learn more about the design of GFS2 and how to debug GFS2 specific issues.
Types of GFS2 tracepoint
There are currently three types of GFS2 tracepoint: glock tracepoints, bmap tracepoints and log tracepoints. These can be used to monitor a running GFS2 filesystem and give additional information to that which can be obtained with the previously (i.e. RHEL5) debugging options. They are particularly useful when a problem, such as a hang or performance issue, is reproducible and thus the tracepoint output can be obtained during the problematic operation. In GFS2, glocks (pronounced gee-locks) are the primary cache control mechanism and they are the key to understanding the performance of the core of GFS2. The bmap (block map) tracepoints can be used to monitor block allocations and block mapping (lookup of already allocated blocks in the on-disk metadata tree) as they happen and check for any issues relating to locality of access. The log tracepoints keep track of the data being written to, and released from, the journal and again can provide useful information on that part of GFS2.
The tracepoints are designed to be as generic as possible. This should mean that it will not be necessary to change the API during the course of RHEL6. On the other hand, users of this interface should be aware that this is a debugging interface and not part of the normal RHEL6 API set, and as such Red Hat makes no guarantees that changes in the GFS2 tracepoints interface will not occur.
Tracepoints are a generic feature of RHEL6 and their scope goes well beyond GFS2. In particular they are used to implement the blktrace infrastructure and the blktrace tracepoints can be used in combination with those of GFS2 to gain a fuller picture of the system performance. Due to the level at which the tracepoints operate, they can produce large volumes of data in a very short period of time. They are designed to put a minimum load on the system when they are enabled, but it is inevitable that they will have some effect. Filtering events via a variety of means can help reduce the volume of data and help focus on obtaining just the information which is useful for understanding any particular situation.
Tracepoints
The tracepoints can be found under /sys/kernel/debug/tracing/ assuming that debugfs is mounted in the usual place at /sys/kernel/debug. The subdirectory events, contains all the tracing events which may be specified, and provided the gfs2 module is loaded, there will be a gfs2 subdirectory containing further subdirectories, one for each GFS2 event. It should look roughly like this:
[root@chywoon gfs2]# ls
enable gfs2_bmap gfs2_glock_queue gfs2_log_flush
filter gfs2_demote_rq gfs2_glock_state_change gfs2_pin
gfs2_block_alloc gfs2_glock_put gfs2_log_blocks gfs2_promote
To enable all the GFS2 tracepoints, run echo -n 1 >/sys/kernel/debug/tracing/events/gfs2/enable. To enable a specific tracepoint, there is an enable file in each of the individual event subdirectories. The same is true of the filter file which can be used to set an event filter for each event or set of events. The meaning of the individual events is explained in more detail below.
The output from the tracepoints is available in ascii or binary. This document does not currently cover the binary interface. The ascii interface is available in two ways. To list the current content of the ring buffer, cat /sys/kernel/debug/tracing/trace. This interface is useful in cases where you are using a long-running process for a certain period of time and, after some event, want to look back at the latest captured information in the buffer. An alternative interface, /sys/kernel/debug/tracing/trace_pipe, can be used when all the output is required. Events are read from this file as they occur; there is no historical information available via this interface. The format of the output is the same from both interfaces and is described for each of the GFS2 events in the later sections of this document.``
A userland tool called trace-cmd is available for reading tracepoint data. For more information on this tool, follow the link in the references section at the end of this article. The trace-cmd tool can be used in a similar way to strace, for example, to run a command while gathering trace data from various sources.
Glocks
To understand GFS2, the most important concept to understand, and the one which sets it aside from other filesystems is the concept of glocks. In terms of the source code, a glock is a data structure which brings together the DLM and caching into a single state machine. Each glock has a 1:1 relationship with a single DLM lock, and provides caching for that lock state so that repetitive operations carried out from a single node of the filesystem do not have to repeatedly call the DLM, and thus they help avoid unnecessary network traffic. There are two broad categories of glocks, those which cache metadata and those which do not. The inode glocks and the resource group glocks both cache metadata, other types of glock do not cache metadata. The inode glock is also involved in the caching of data in addition to metadata and has the most complex logic of all glocks.
| Glock mode | DLM lock mode | Notes |
|---|---|---|
| UN | IV/NL | Unlocked (no DLM lock associated with glock or NL lock depending on I flag) |
| SH | PR | Shared (protected read) lock |
| EX | EX | Exclusive lock |
| DF | CW | Deferred (concurrent write) used for Direct I/O and filesystem freeze |
Note: this particular aspect has changed since RHEL5, which does sometimes unlock the DLM locks attached to glocks completely, and thus RHEL5 has a different mechanism to ensure that LVBs (lock value blocks) are preserved where required. The new scheme that RHEL6 uses was made possible due to the merging of the lock_dlm lock module (not to be confused with the DLM itself) into GFS2.
Each glock can have a number of "holders" associated with it, each of which represents one lock request from the higher layers. System calls relating to GFS2 queue and dequeue holders from the glock to protect the critical section of code.
The glock state machine is based on a workqueue. For performance reasons, we would prefer to use tasklets, however in the current implementation we need to submit I/O from that context which prohibits their use.
Note: workqueues have their own tracepoints which can be used in combination with the GFS2 tracepoints if desired
| Glock mode | Cache Data | Cache Metadata | Dirty Data | Dirty Metadata |
|---|---|---|---|---|
| UN | No | No | No | No |
| SH | Yes | Yes | No | No |
| DF | No | Yes | No | No |
| EX | Yes | Yes | Yes | Yes |
The glock debugfs interface
The glock debugfs interface allows the visualisation of the internal state of the glocks, the holders and it also includes some summary details of the objects being locked in some cases. Each line of the file either begins G: with no indentation (which refers to the glock itself) or it begins with a different letter, indented with a single space, and refers to the structures (H: is a holder, I: an inode, and R: a resource group) associated with the glock immediately above it in the file. Here is an example of what the content of this file might look like:
G: s:SH n:5/75320 f:I t:SH d:EX/0 a:0 r:3
H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G: s:EX n:3/258028 f:yI t:EX d:EX/0 a:3 r:4
H: s:EX f:tH e:0 p:4466 [postmark] gfs2_inplace_reserve_i+0x177/0x780 [gfs2]
R: n:258028 f:05 b:22256/22256 i:16800
G: s:EX n:2/219916 f:yfI t:EX d:EX/0 a:0 r:3
I: n:75661/219916 t:8 f:0x10 d:0x00000000 s:7522/7522
G: s:SH n:5/127205 f:I t:SH d:EX/0 a:0 r:3
H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G: s:EX n:2/50382 f:yfI t:EX d:EX/0 a:0 r:2
G: s:SH n:5/302519 f:I t:SH d:EX/0 a:0 r:3
H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G: s:SH n:5/313874 f:I t:SH d:EX/0 a:0 r:3
H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G: s:SH n:5/271916 f:I t:SH d:EX/0 a:0 r:3
H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
G: s:SH n:5/312732 f:I t:SH d:EX/0 a:0 r:3
H: s:SH f:EH e:0 p:4466 [postmark] gfs2_inode_lookup+0x14e/0x260 [gfs2]
The above example is a series of excerpts (from an approximately 18MB file) generated by cat /sys/kernel/debug/gfs2/unity:myfs/glocks >my.lock} during a run of the postmark benchmark on a single node GFS2 filesystem. The glocks in the figure have been selected in order to show some of the more interesting features of the glock dumps.
The glock states are either EX (exclusive), DF (deferred), SH (shared) or UN (unlocked). These states correspond directly with DLM lock modes except for UN which may represent either the DLM null lock state, or that GFS2 doesn't hold a DLM lock (depending on the I flag as explained above). The s: field of the glock indicates the current state of the lock and the same field in the holder indicates the requested mode. If the lock is granted, the holder will have the H bit set in its flags (f: field). Otherwise, it will have the W wait bit set.
The n: field (number) indicates the number associated with each item. For glocks, that is the type number followed by the glock number so that in the above example, the first glock is n:5/75320, i.e. an iopen glock which relates to inode 75320. In the case of inode and iopen glocks, the glock number is always identical to the inode's disk block number.
Note: The glock numbers (n: field) in the debugfs glocks file are in hexadecimal, whereas the tracepoints output lists them in decimal. This is for historical reasons: glock numbers were always written in hex, but decimal was chosen for the tracepoints so that the numbers could easily be compared with the other tracepoint output (from blktrace for example) and with output from stat(1).
The full listing of all the flags for both the holder and the glock are set out in the two tables below. The content of lock value blocks is not currently available via the glock debugfs interface, although Red Hat may add this in the future.
Glock Types
| Type number | Glock type | Use |
| 1 | Trans | Transaction Lock |
| 2 | Inode | Inode metadata and data |
| 3 | Resource group | Resource group metadata |
| 4 | Meta | The superblock |
| 5 | Iopen | Inode last closer detection |
| 6 | Flock | flock(2) syscall |
| 8 | Quota | Quota operations |
| 9 | Journal | Journal mutex |
| Flag | Name | Meaning |
|---|---|---|
l |
Locked | The glock is in the process of changing state |
D |
Demote | A demote request (local or remote) |
d |
Demote pending | A deferred (remote) demote request |
p |
Demote in progress | The glock is in the process of responding to a demote request |
y |
Dirty | Data needs flushing to disk before releasing this glock |
f |
Log flush | The log needs to be committed before releasing this glock |
i |
Invalidate in progress | In the process of invalidating pages under this glock |
r |
Reply pending | Reply received from remote node is awaiting processing |
I |
Initial | Set when DLM lock is associated with this glock |
F |
Frozen | Replies from remote nodes ignored - recovery is in progress |
If the time interval has expired, then the D (demote) flag will be set and the state required will be recorded. In that case the next time there are no granted locks on the holders queue, the lock will be demoted. If the time interval has not expired, then the d (demote pending) flag is set instead. This also schedules the state machine to clear d (demote pending) and set D (demote) when the minimum hold time has expired.
The I (initial) flag is set when the glock has been assigned a DLM lock. This happens when the glock is first used and the I flag will then remain set until the glock is finally freed (which the DLM lock is unlocked).
Glock holders
| Flag | Name | Meaning |
|---|---|---|
t |
Try | A "try" lock |
T |
Try 1CB | A "try" lock which sends a single callback |
e |
No expire | Ignore subsequent lock cancel requests |
A |
Any | Any compatible lock mode is acceptable |
p |
Priority | Enqueue lock request at head of queue |
a |
Async | Don't wait for glock result (process will poll later) |
E |
Exact | Must have exact lock mode |
c |
No cache | When unlocked, demote DLM lock immediately |
H |
Holder | Indicates that request is granted |
W |
Wait | Set while waiting for request to complete |
F |
First | Set when holder is first to be granted for this glock |
If there are no granted holders, then the first holder in the list will be the one which triggers the next state change. Since demote requests are always considered higher priority than requests from the filesystem, that might not always directly result in a change to the state requested.
The glock subsystem supports two kinds of "try" lock. These are useful both because they allow the taking of locks out of the normal order (with suitable back-off and retry) and because they can be used to help avoid resources in use by other nodes. The normal t (try) lock is basically just that. The so-called T (try 1CB) is identical except that the DLM will send a single callback to current incompatible lock holders. One use of the T try 1CB is with the iopen locks, which are used to arbitrate among the nodes when an inode's i_nlink count is zero, and determine which of the nodes will be responsible for deallocating the inode. The iopen glock is normally held in the shared state, but when the i_nlink count becomes zero and ->delete_inode() is called, it will request an exclusive lock with T (try 1CB) set. It will continue to deallocate the inode if the lock is granted. If the lock is not granted it will result in the node(s) which were preventing the grant of the lock marking their glock(s) with the D (demote) flag, which is checked at ->drop_inode() time in order to ensure that the deallocation is not forgotten.
This means that inodes that have zero link count, but are still open will be deallocated by the node on which the final close() occurs. Also, at the same time as the inode's link count is decremented to zero, the inode is marked as being in the special state of having zero link count, but still in use in the resource group bitmap. This functions like ext3's orphan list in that it allows any subsequent reader of the bitmap to know that there is potentially space that might be reclaimed, and to attempt to reclaim it.
Glock tracepoints
The tracepoints are also designed to be able to confirm the correctness of the cache control by combining them with the blktrace output and with knowledge of the on-disk layout. It is then possible to check that any given I/O has been issued and completed under the correct lock, and that no races are present.
The gfs2_glock_state_change tracepoint is the most important one to understand. It tracks every state change of the glock from initial creation right through to the final demotion which ends with gfs2_glock_put and the final NL to unlocked transistion. The l (locked) glock flag is always set before a state change occurs and will not be cleared until after it has finished. There are never any granted holders (the H glock holder flag) during a state change. If there are any queued holders, they will always be in the W (waiting) state. When the state change is complete then the holders may be granted which is the final operation before the l glock flag is cleared.
The gfs2_demote_rq tracepoint keeps track of demote requests, both local and remote. Assuming that there is enough memory on the node, the local demote requests will rarely be seen, and most often they will be created by umount or by occasional memory reclaim. The number of remote demote requests is a measure of the contention between nodes for a particular inode or resource group.
When a holder is granted a lock,gfs2_promoteis called, this occurs as the final stages of a state change or when a lock is requested which can be granted immediately due to the glock state already caching a lock of a suitable mode.If the holder is the first one to be granted for this glock, then the f (first) flag is set on that holder. This is currently used only by resource groups and the flag may eventually be removed altogether.``
Bmap tracepoints
Block mapping is a task central to any filesystem. GFS2 uses a traditional bitmap-based system with two bits per block. The main purpose of the tracepoints in this subsystem is to allow monitoring of the time taken to allocate and map blocks.
The gfs2_bmap tracepoint is called twice for each bmap operation: once at the start to display the bmap request, and once at the end to display the result. This makes it easy to match the requests and results together and measure the time taken to map blocks n different parts of the filesystem, different file offsets, or even of different files. It is also possible to see what the average extent sizes being returned are in comparison to those being requested.
To keep track of allocated blocks, gfs2_block_alloc is called not only on allocations, but also on freeing of blocks. Since the allocations are all referenced according to the inode for which the block is intended, this can be used to track which physical blocks belong to which files in a live filesystem. This is particularly useful when combined with blktrace which will show problematic I/O patterns which may then be referred back to the relavent inodes using the mapping gained via this tracepoint.
Log tracepoints
The tracepoints in this subsystem track blocks being added to and removed from the journal (gfs2_pin), as well as the time taken to commit the transactions to the log (gfs2_log_flush). This can be very useful when trying to debug journalling performance issues.
The gfs2_log_blocks tracepoint keeps track of the reserved blocks in the log, which can help show if the log is too small for the workload, for example.
References
- This document has been partially adapted from a paper delivered at Linux Synposium 2009: Content from kernel.org is not included.Content from kernel.org is not included.http://kernel.org/doc/ols/2009/ols2009-pages-311-318.pdf
- Content from git.kernel.org is not included.Content from git.kernel.org is not included.http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/gfs2-glocks.txt;h=0494f78d87e40c225eb1dc1a1489acd891210761;hb=HEAD
- Content from git.kernel.org is not included.Content from git.kernel.org is not included.http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/trace/events.txt;h=09bd8e9029892e4e1d48078de4d076e24eff3dd2;hb=HEAD
- Details on
trace-cmdcan be found here: Content from lwn.net is not included.Content from lwn.net is not included.http://lwn.net/Articles/341902/