What does gfs_controld do?

Updated

Introduction

The gfs_controld program is part of the cman package that is in Red Hat Enterprise Linux(RHEL) Server 5 and 6. The gfs_controld program is shared by both GFS and GFS2. It provides the userland support required for certain filesystem functions, including mounting, unmounting and recovery. During normal filesystem operation, gfs_controld is mostly idle, although in earlier versions, it used to provide the POSIX fcntl locking function, which was later moved to DLM as noted in the following article. This is explained in more detail below.

Normally, users of the filesystem GFS or GFS2 do not need to interact directly with gfs_controld, since it is automatically run by the initscripts. The information presented here is useful mostly for debugging or for those who need to understand the internals of GFS/GFS2 in greater detail.

The attached diagram (see the references for the .fig format original) describes the basic architecture of gfs_controld and shows how it fits into the cluster suite. The boxes are color coded according to function. Green is for admin tools, blue is for the kernel, yellow is for daemons (including gfs_controld) and red is for libraries which are used as part of the communication process. The various communication paths are shown by arrows which are annotated according to the method of communication used. The paths labelled CPG: are corosync closed process groups. Most of the kernel and user communication is based around uevents and sysfs, with the exception of a misc device which is used for the POSIX fcntl locks. The diagram was created after the point when the POSIX fcntl locks moved to dlm_controld. For more information see the POSIX fcntl locking section, below.

Role during filesystem mount

gfs_controld is responsible for joining the cluster at mount time. The mount.gfs and mount.gfs2 helper programs calls gfs_controld via a UNIX socket in order to initiate the joining process. Just like other filesystem mount helpers, mount.gfs2 should only be called from mount and not used directly by users.

In upstream (Fedora) GFS2, there is no longer any requirement for mount.gfs2 since the kernel's uevents are sufficient to allow gfs_controld to join the cluster without external help. gfs_controld can tell whether mount.gfs2 exists according to whether it receives a uevent from the kernel before it has been contacted by mount.gfs2. Thus gfs_controld is backward compatible. The long term plan is to remove mount.gfs2 entirely and operate only using uevents and sysfs.

udevadm

The udevadm program can be used to watch what happens during mount,umount or recovery. This can be very useful when debugging problems related to those particular events. The command below will display all the uevents generated alongside their properties (environment variables) as they occur. For further details, please see the udevadm man page:

$ udevadm monitor --kernel --property

An example of what you should see on a successful mount is as follows:

KERNEL[1291651244.481347] add /module/gfs2 (module)
UDEV_LOG=3
ACTION=add
DEVPATH=/module/gfs2
SUBSYSTEM=module
SEQNUM=1490

KERNEL[1291651244.514006] add /fs/gfs2/unity:myfs (gfs2)
UDEV_LOG=3
ACTION=add
DEVPATH=/fs/gfs2/unity:myfs
SUBSYSTEM=gfs2
RDONLY=0
SPECTATOR=0
LOCKTABLE=unity:myfs
LOCKPROTO=lock_dlm
UUID=CE23E582-D0CC-078C-2434-3A30B4997D7A
SEQNUM=1491

KERNEL[1291651244.515863] add /kernel/dlm/myfs (dlm)
UDEV_LOG=3
ACTION=add
DEVPATH=/kernel/dlm/myfs
SUBSYSTEM=dlm
LOCKSPACE=myfs
SEQNUM=1492

KERNEL[1291651244.516058] online /kernel/dlm/myfs (dlm)
UDEV_LOG=3
ACTION=online
DEVPATH=/kernel/dlm/myfs
SUBSYSTEM=dlm
LOCKSPACE=myfs
SEQNUM=1493

KERNEL[1291651245.436739] change /fs/gfs2/unity:myfs (gfs2)
UDEV_LOG=3
ACTION=change
DEVPATH=/fs/gfs2/unity:myfs
SUBSYSTEM=gfs2
JID=0
RECOVERY=Done
LOCKTABLE=unity:myfs
LOCKPROTO=lock_dlm
JOURNALID=0
UUID=CE23E582-D0CC-078C-2434-3A30B4997D7A
SEQNUM=1494

KERNEL[1291651245.440182] change /fs/gfs2/unity:myfs (gfs2)
UDEV_LOG=3
ACTION=change
DEVPATH=/fs/gfs2/unity:myfs
SUBSYSTEM=gfs2
FIRSTMOUNT=Done
LOCKTABLE=unity:myfs
LOCKPROTO=lock_dlm
JOURNALID=0
UUID=CE23E582-D0CC-078C-2434-3A30B4997D7A
SEQNUM=1495

KERNEL[1291651246.168587] online /fs/gfs2/unity:myfs (gfs2)
UDEV_LOG=3
ACTION=online
DEVPATH=/fs/gfs2/unity:myfs
SUBSYSTEM=gfs2
RDONLY=0
SPECTATOR=0
LOCKTABLE=unity:myfs
LOCKPROTO=lock_dlm
JOURNALID=0
UUID=CE23E582-D0CC-078C-2434-3A30B4997D7A
SEQNUM=1496

The first uevent in the list is the gfs2 module loading. The SUBSYSTEM variable tells you which module produced the event in question, and there are also DLM uevents (used by dlm_controld) mixed in with the GFS2 uevents. If the mount has not been successful, then instead of receiving an ACTION=online uevent, there would have been an ACTION=remove uevent, or possibly even no uevents at all depending at what point the mount failed.

The ACTION=change uevent is issued for two reasons:

  • When a journal has been replayed (RECOVERY=Done or RECOVERY=Failed).
  • When the first mount of the filesystem has succeeded (FIRSTMOUNT=Done).

Later mounts of the filesystem only check their assigned journals, the first mounter checks all the journals in the filesystem. An example of both of those uevents are shown above.

Role during filesystem umount

In RHEL 5, gfs_controld is contacted by umount.gfs or umount.gfs2 when a filesystem is unmounted. This means that great care has to be taken with bind mounts on GFS and GFS2 on RHEL 5, since the mount helper is unaware of any other mounts that have occurred. In order to avoid problems when unmounting a GFS or GFS2 filesystem which has bind mounts, the bind mounts should be unmounted before the initial filesystem unmount.

In RHEL 6, gfs_controld uses the uevents to determine when a filesystem is being unmounted. This is a far better solution since it takes account of all possible holders of references on the filesystem and thus unmount order with bind mounts is no longer important.

Here is an example of uevents occurring during umount of a GFS2 filesystem:

KERNEL[1291652100.778260] offline /kernel/dlm/myfs (dlm)
UDEV_LOG=3
ACTION=offline
DEVPATH=/kernel/dlm/myfs
SUBSYSTEM=dlm
LOCKSPACE=myfs
SEQNUM=1497

KERNEL[1291652100.778951] remove /kernel/dlm/myfs (dlm)
UDEV_LOG=3
ACTION=remove
DEVPATH=/kernel/dlm/myfs
SUBSYSTEM=dlm
LOCKSPACE=myfs
SEQNUM=1498

KERNEL[1291652100.779215] remove /fs/gfs2/unity:myfs (gfs2)
UDEV_LOG=3
ACTION=remove
DEVPATH=/fs/gfs2/unity:myfs
SUBSYSTEM=gfs2
LOCKTABLE=unity:myfs
LOCKPROTO=lock_dlm
JOURNALID=0
UUID=CE23E582-D0CC-078C-2434-3A30B4997D7A
SEQNUM=1499

Recovery

gfs_controld initiates recovery operations on the filesystem via the sysfs interface. It writes the journal ID of the journal to be recovered into the lock_module/recover sysfs subdirectory for the filesystem in question. The kernel will then attempt to recover the journal and send a response via an ACTION=change uevent with RECOVERY=Done or RECOVERY=Failed exactly as per the example above shows in Role during filesystem mount. The kernel may refuse to recover a journal in certain cases. For example the filesystem is mounted with the spectator option, or if the journal belongs to the cluster node itself, or if the journal ID doesn't match any known journal.

gfs_controld also uses the "lock_module/block" sysfs file during recovery to block the actions of most of the glocks in the cluster so that when the DLM recovery is initiated, glocks relating to inodes which need to be recovered will not be granted until recovery is complete. The block state is set on all cluster nodes in the cluster while recovery is in progress and cleared once recovery is complete. Cluster nodes may continue to use glocks which were already granted during this period of time, but they may (in general) not get new locks. Locks for which replies from the DLM have arrived during this blocked period, will show up in the glock dumps as having the 'F' (frozen) flag set.

The recover_done and recover_status files in the sysfs lock_module subdirectory contain the status of the most recent recovery to have taken place. These files should not be used by applications since there is a potential race condition if multiple journals are recovered on the same node at once. Instead the status should be obtained via the uevents which are generated. The recover_done and recover_status files are considered obsolete and will be removed in future kernels for this reason.

Withdraw

GFS and GFS2 both support the withdraw function as a method of error handling when certain critical events occur. The sequence is initiated with an ACTION=offline uevent from the kernel which is read by gfs_controld. When gfs_controld receives this event it runs dmsetup to insert an error target under the filesystem to ensure that no further I/O can take place. gfs_controld then withdraws from the cluster and signals success to the filesystem via sysfs. This is the reason that we require that all GFS and GFS2 filesystems must run on device mapper devices, since otherwise the withdraw function will fail to work correctly.

The withdraw function has been historically rather problematic, in that critical events in the filesystem often have side effects which result in the userspace gfs_controld from being unable to complete the withdraw sequence. Nevertheless, it does mean that graceful withdrawal from the cluster is possible in many cases.

POSIX fcntl locking

In earlier versions of gfs_controld, the POSIX locking function was delegated to gfs_controld. More recently, dlm_controld has taken over this role in order that the POSIX locking code can be shared between GFS2 and OCFS2 (and any other cluster filesystem which may require it in future). For more information about GFS2 and POSIX fcntl locking then review the following article.

References

Category
Components
Article Type