How does GFS2 know when to deallocate a file?
Introduction
When an inode is unlinked on Linux, it does not usually get deallocated immediately. Instead, the directory entry is removed to prevent any more accesses to the inode via that name. When all the names are gone, the inode's link count will be zero, and the only processes able to access the inode are those who opened it before the last name was removed from the filesystem. The last process to close the inode then sets in motion the process to deallocate the disk blocks belonging to the file. This is a relatively straight forward process for "local" filesystems such as ext3 or ext4, however it becomes more complicated with a clustered filesystem like GFS2. One of the reasons that GFS2 deallocation of inode is more complicated is because the processes may be distributed around the cluster. This article explains how GFS2 manages this process and what the implications are.
The iopen glock
Every inode on a GFS2 filesystem is assigned two glocks. The inode glock is used to manage access to the inode's data and metadata and is involved in almost all operations relating to the inode. The second glock, known as the iopen lock has only one function, which is to provide the notification mechanism which is used to ensure that an inode is deallocated at the end of its life, no matter where in the cluster it is open.
During most filesystem activity, the iopen lock is held in the SH (shared) state whenever the inode is in the VFS's inode cache. The lock is obtained when the inode is either created or looked up, and it remains until the inode is evicted from the inode cache. If the inode is evicted due to umount or due to memory pressure, then the iopen glock will just be demoted to the UN (unlocked) state. If on the other hand, the inode is evicted because it has been closed and its link count is zero, then the iopen lock tries to get an EX (exclusive) lock on the inode. If that it successful, then that means there cannot be any more nodes which have this inode open (if there were, then the SH locks on the other nodes would prevent the EX lock from succeeding) and then the deallocation proceeds as for a local filesystem safe in the knowledge that no other process still has any references to the inode in question.
If, on the other hand, the request to promote to EX (exclusive) fails, then the callback which is sent to all the other nodes at that point in time results in them marking the inode as having a zero link count. This is important since those other nodes may not have the up-to-date (zero) link count in their cache, in-core, or inodes unless they've reread the inode recently. That ensures that if one of those processes then goes to close the inode without performing any action which would have refreshed the inode content from disk, that the zero link count is then known and that process will then repeat the process of trying to upgrade its SH (shared) iopen glock to EX (exclusive) and performing the deallocation if it is the final process to have the inode open.
So in GFS2, just like in the single node filesystem case, the process which ends up deallocating the inode is the process which is last to close it. We use the iopen glock to pass responsibility for deallocating the inode on, in the case where an inode is being closed on a node when there are other processes on a different node still holding that inode open. This means, for example, that we can unlink an inode and then umount the filesystem on that node, despite the fact that another node is holding that inode open, and in that case the other node will eventually deallocate the inode in question. One implication of this is that the free space will not appear until after the final node has deallocated the inode, so if you delete a large file and wonder why the space has not reappeared, then it may be because there is a process holding the file open elsewhere in the cluster.
Recovery
By now you are probably wondering what happens when the node holding the final reference to the inode is rebooted or fenced before it has a chance to deallocate the inode. The answer again is very similar to the ext3 filesystem of using an orphan list. The GFS2 filesystem has a set of resource groups, similar to ext3's block groups, each of which contains a bitmap of two bits per block which determine the allocation state of each block in the filesystem. GFS2 uses one of the four bitmap states belonging to each block to mark inodes which have zero link count and no longer are linked into any directory. In order to make this atomic the removal from the directory, the updating of the link count to zero and the marking of the block as unlinked inode are done in the same transaction.
If the node is unable to complete the deallocation, then the unlinked inode entry in the bitmap is a marker that there are blocks here which are potentially available for reuse. The recovery algorithm does not search for these blocks during journal recovery, since this would take a long time on a large filesystem. The fsck.gfs2 utility will free any inodes it finds which are marked as unlinked inode. GFS2 will eventually find and deallocate such inodes automatically, and it does this when it looks for free blocks in a resource group during a subsequent allocation request. The reason that these two processes are done together is that at the point of making an allocation request, the resource group will already be locked exclusively and the correct bitmap information will already be in memory. So in other words, we can search the bitmap for unlinked inode almost for free in terms of the disk I/O and cpu time required.
The down side of this algorithm is that the inode will appear to still be allocated (so far as df is concerned, for example) until a subsequent allocation causes the deallocation of the unlinked inode. This means that there can sometimes be slightly more disk space on a filesystem than it immediately appears.
Read-only mounts
Another case which the code has to deal with is when there are a mix of read only and read write nodes in the cluster and the last node to close a particular (unlinked) inode is a read only node. In this case the read only cluster node simply ignore the inode and it will be deallocated the next time one of the read/write cluster nodes tries to allocate blocks from that resource group, just like the recovery case above.
References
- For more information on debugging GFS2 performance issues then see this This content is not included.article.
- For more information on the internals of
GFS2then review this article. - Why is my GFS2 filesystem performance slow when doing a
rm -rf *operation?