What are some best practices when running a backup of a GFS2 filesystem in a RHEL Resilient Storage cluster?
Environment
- Red Hat Enterprise Linux Server 5, 6, or 7 with the Resilient Storage Add On
- A Global File System (
GFS) or Global File System 2 (GFS2)
Issue
- What are some best practices when running a backup of a
GFS2filesystem? - Backups of a GFS2 file system take a long time to complete
- My backup software runs more slowly against GFS2 than it does ext3 or ext4.
Resolution
It is possible to have reasonable performance while backing up a GFS2 filesystem as long as you follow a couple of recommendations that are listed below.
General best practices
- Try to perform the backups when the load on the GFS2 filesystem is the lowest.
- The backup procedure for the GFS2 filesystem should be tested at normal and peak loads to verify that performance is not severely impacted before putting a cluster in production.
- When a gfs2 filesystem is going to be backed up then at some point in the procedure the gfs2 will required to be frozen with the command that freezes the gfs2 filesystem . Freezing the filesystem will guarantee that the filesystem is not in the middle of any transaction which will help to guarantee its integrity. If the filesystem is not frozen then the hardware snapshot or software backed up filesystem could be corrupted and for this reason we do not support taking a backup of a gfs2 filesystem without freezing of the filesystem.
Hardware based backups
- Use hardware snap-shotting whenever possible to backup the GFS2 filesystem. This is a feature of many of the high-end SAN storage arrays and will yield the best performance. We recommend block-based backups of some kind (storage array snapshots or clones, whole-disk backups, etc).
- If possible the filesystem should be unmounted on all cluster nodes, but if unmounting is not possible then you will need to use the command that freezes the filesystem .
If hardware based backup is not possible
- Then smartly configuring your backup software is still the first line of defense against lock contention or other performance problems that can occur when backing up a GFS2 filesystem. The following article describes why there can be performance issues with a software backup solution on a GFS2 filesystem: Why does rsync cause performance problems and process hangs with GFS2?
- Try to use backup software that distributes the file
opensandstatsbetween the nodes so that the locks don't all accumulate in one place. Such as having each node back up the sections of the GFS2 filesystem that it uses the most. - Stagger the backups so that all nodes are not running backups at the same time. If it is impossible to stagger or distribute the backing up of a GFS2 filesystem among the other cluster nodes, at least unmount the GFS2 filesystem from the machine that does the backup after the backup is complete, thus freeing the locks. You can remount the file system immediately if you want, but the idea is to free the locks and go back to only opening and caching the locks on demand.
- If you are running RHEL 5 then make sure that the host is running the kernel detailed in this article that will help with backup performance. That article details the errata to update to and a workaround that can be used on RHEL 5 or RHEL6 for flushing the cache. For more information on flushing the cache then see the following article: Should the cache be dropped on a GFS2 file-system when a performance issue is occurring?
Example procedure for hardware replication of a frozen GFS2 filesystem
- Suspend/stop the application (where possible) so that it won't update the filesystem anymore.
- Freeze the filesystem with the command that freezes the filesystem. If the gfs2 filesystem is frozen then that will prevent write operations from occurring on the filesystem, but will not prevent read operations from occurring.
- Then use SAN replication.
- Snapshot the LUN via LVM snapshot: lvm snapshots of clustered logical volumes are only possible via exclusively activating the logical volume on one node (in such a case freezing of gfs2 filesystem prior to snapshot is not sufficient and gfs2 needs to be unmounted on all nodes). For more information see the article: Can I take snapshots of clustered logical volumes in RHEL 6, 7?
- Unfreeze the files system.
- Enables the application again.
- Mount the snapshot volume with
lockproto=lock_nolockto a different directory. - Now your backup software runs and backs up the snapshot files. It will be fast because there is no cluster locking overhead when mounting with
lock_nolock. - Unmount the snapshot from the system.
- Delete the snapshot.
Freezing a gfs2 filesystem
- RHEL 5,6:
# gfs2_tool freeze <mountpoint> - RHEL 7+:
# fsfreeze --freeze <mountpoint>
Unfreezing a gfs2 filesystem
- RHEL 5,6:
# gfs2_tool unfreeze <mountpoint> - RHEL 7+:
# fsfreeze --unfreeze <mountpoint>
You can read more about known issues and possible solutions to backing up GFS2 filesystem, and alternative recommendations in these articles below:
- GFS2 Documentation | 2.6. File System Backups
- How to Improve GFS/GFS2 File System Performance and Prevent Processes from Hanging
- GFS2 Best Practices
- Why does rsync cause performance problems and process hangs with GFS2?
Root Cause
Backups for GFS2 are tricky, and as far as I know, there isn't a clustered backup solution that can do it well. The problem is performance: often, the backups take a long time to complete, and GFS2 performance becomes very bad after a backup is done (and while it is done).
The problem is most backup software opens every file (or at least stats them) and then makes copies, all from one single node. The problem with that is: that single node often ties up millions of inter-node cluster locks (glocks). That fills up its memory and its hash tables, causing it to slow down horribly.
The backups are often done during down time, (for example, right after a mount but before any other activity is started) which means GFS2 may even make that now-horribly-slow node the DLM lock master for all those millions of glocks. That means that all requests for those locks go through that one slow node, which defeats the point of DLM trying to distribute the locks to the nodes that need them the most.
File-based backup solutions like Netbackup are known to cause significant problems for performance on GFS2, but following the best practices should help alleviate some of performances issues.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.