Red Hat Storage (GlusterFS) rebalance issue

Solution Unverified - Updated

Environment

  • Red Hat Storage 2.1 Update 4

Issue

Problem synopsis:
When a rebalance operation was triggered on a live (when file operations are in progress) volume, it was observed that many files were lost. The lost files could not be recovered.

Resolution

Recommendation to mitigate the risk:

Until Red Hat releases patch to fix the issue, do not run the Rebalance operation. If you are running out of capacity and need to add storage to the trusted storage pool then simply add new bricks without running rebalance. The min-disk free configuration parameter in a volume will start to move data to the newly added bricks. The data distribution won't be balanced i.e. the older bricks will have more data than the newer ones but that should not impact performance.

When to expect a fix:

Fix is expected in RHS 2.1 Update 5 release. The Update 5 is targeted to be released in mid/end of October 2014.

Root Cause

Cause of the problem:

  • A rebalance process recreates directory layouts to accommodate the changed graph and migrates files from their current location to their new hashed subvolumes when required.
  • During the file migration process, it creates a linkto file on the new hashed subvolume, copies the data to it, converts it to a non-linkto file and deletes the original cached file.
  • Operations like lookup delete what it considers a stale linkto file without checking to see if it is on the hashed subvolume (which would make it a valid linkto file).
  • When both lookup and rebalance act on the same file at the same time, there can exist a situation where the rebalance process creates the target linkto file on the new hashed subvolume and the lookup treats it as stale linkto file and unlinks it. So at the end of the file migration, both source and target files are deleted causing data loss.
  • The solution is to add additional checks in the stale linkfile deletion operation to ensure that it is not deleting a file that is currently being worked on by the rebalance process.
  • Another cause of data loss during rebalance is due to mismatches in layouts in different rebalance processes.
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.