Why does rsync cause performance problems and process hangs with GFS2?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux Server 5, 6, 7 (with the High Availability and Resilient Storage Add Ons)
  • Red Hat Global Filesystem 2 (GFS2)

Issue

  • Using rsync to or from GFS2 resulted in performance problems, processes in D state waiting on the filesystem, and call traces in the logs.
  • Why does rsync cause problems with GFS2?

Resolution

Workloads that "crawl" and entire filesystem or deep directory structure are not well suited for use to or from GFS2. Examples of such workloads would be rsync, some backup applications, and version control systems such as git or subversion.

It is advised to not use such workloads to or from GFS2 as performance problems may arise .If such a workload must be used with GFS2 and no other alternative is available it is recommended that the GFS2 filesystem only be mounted on one node and no other process access the filesystem while the workload in questions is operating on the filesystem. Then the filesystem is unmounted in order to release the DLM locks.

For information on backing up GFS2 then review the following article: What are some best practices when running a backup of a GFS2 filesystem in a RHEL Resilient Storage cluster?

Root Cause

Workloads that often "crawl" the entire filesystem or the entirety of a deep and often highly-branched directory structure can cause performance issues on a gfs2 filesystem.

  • This typically involved calling stat() on all the directories on the filesystem. The issue with stat() is that the dinode actually needs to be read in, whereas just a plain old ls can traverse the directory without reading the dinodes in. The stat() needs the data from the dinode, but a simple ls just needs the data from the directory that points to it.
  • Another potential issue is when any cluster node is actively doing operations on the same set of files that rsync is traversing because of lock contention. This also can cause glock overhead as caches may need to be invalidated if other nodes are working in that (same) directory structure at the (same) time that the workload was ran.
  • When doing any type of crawling of the filesystem, be careful that one that is crawling the filesystem is not the first one to access the files. The reason is that gfs2 uses DLM for distributed locking. With DLM the first cluster node to request a DLM lock becomes the DLM lock master until that cluster node unmounts the filesystem. This can cause one cluster node to be the master of most of the files which in turn causes a performance issues with the other cluster nodes as DLM is behaving like a central authority instead of distributed (all cluster nodes in theory should be master of equal number of DLM locks).

The workload characteristics of rsync can cause considerable performance problems including: processes accessing GFS2 on multiple nodes in D state for extended periods, hung task timeout call traces in the logs, or the operation in question taking an extremely long amount of time to complete.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.