Why does the `LVM-activate` resource fails to activate the VG with `ERROR: vg_name: failed to activate` ?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 8
  • Pacemaker Cluster managing both GFS2 as well as XFS or EXT3/4 filesystems

Issue

  • The active/passive LVM-activate resource fails to start when Pacemaker cluster is configured to manage both GFS2 & XFS filesystem with the following error:
Apr 23 14:35:26 node1 LVM-activate(web-lv)[1460784]: INFO: Activating cluster_vg
Apr 23 14:35:26 node1 LVM-activate(web-lv)[1460784]: ERROR: cluster_vg: failed to activate.
Apr 23 14:35:26 node1 pacemaker-controld[1459385]: notice: Result of start operation for web-lv on node1: error (cluster_vg: failed to activate.)

Resolution

Red Hat Enterprise Linux 8


It is recommended to configure the Pacemaker cluster managing active/active GFS2 filesystem along with XFS (or EXT4) filesystem referring to the steps outlined following articles:

However if there is a need to use system_id along with use_lvmlockd, then there are two possible ways to fix this issue:

  1. If possible add a dependency (start order) between the GFS2 resource group & the XFS resource group where the XFS group is triggered for a start after GFS2 group is started.

  2. Add a resource agent ocf:heartbeat:Delay in the XFS resource group before the LVM-activate resource so to add a delay in start operation for the LVM-activate resource of XFS filesystem. This delay will eventually allow the LVM-activate resource associated with the GFS2 device to get started and it starts the lock manager:

# pcs resource create delay ocf:heartbeat:Delay startdelay=10 stopdelay=0 mondelay=0 --group xfs_group --before xfs-fs op monitor timeout=120 start timeout=120 stop timeout=120
  • Note: The above ocf:heartbeat:Delay resource allows a startdelay of 10sec. This should be tested in setup so to ensure that lock manager is started within those 10sec of startdelay value.

Root Cause

When the setup is configured to manage active/active GFS2 filesystem as well as active/passive XFS or EXT3/4 filesystem and there is no start order between the GFS2 filesystem & XFS filesystem, pacemaker cluster triggers the start operation for both the LVM-activate resources parallely (one managing the activation of device for GFS2 & other managing activation of the device associated with XFS (or EXT3/4) filesystem).

As per the resolution steps detailed in the following KCS article, it ensures that the lvmlockd process is started before attempting the start operation of LVM-activate resource associated with the XFS (or EXT3/4) filesystem. Hence the error for WARNING: lvmlockd process is not running is avoided.

Updating the system_id on an active/passive VG is a global operation. So when lvmlockd is enabled, it requires acquiring the global lock from lvmlockd. Acquiring the global lock requires not only that lvmlockd is started, but also that the global lockspace has been started (i.e. lvm_global). The global lockspace is started automatically by vgchange --lockstart, which is run for the active/active shared VG used for GFS2 filesystem. Assuming that it is required to execute command vgchange --systemid and succeed, then it should be run after vgchange --lockstart on the GFS2 VG has finished in a setup which has lvmlockd enabled.

In other words, activation of the LVM-activate for GFS2 filesystem performs vgchange --lockstart which triggers for the start of global lockspace. The other LVM-activate resource managing the device for XFS filesystem is also triggered for the start operation at the same time which happens during or before the global lockspace is completely started. This eventually results the failure of activating the VG/LV associated with XFS (or EXT3/4) device.

The issue is only evident when there is location constraint configured for the XFS group to prefer one of the node and before the cluster is restarted, the XFS group was previously Started on the less preferred node.

Diagnostic Steps

  1. Since there no dependency between GFS2 filesystem device & XFS filesystem device, cluster triggers activation of both the LVM-activate resource at same time:
Apr 23 14:35:25 node1 pacemaker-controld[1459385]: notice: Requesting local execution of start operation for gfs-lv on node1
Apr 23 14:35:25 node1 pacemaker-controld[1459385]: notice: Requesting local execution of start operation for web-lv on node1
Apr 23 14:35:25 node1 LVM-activate(gfs-lv)[1460662]: INFO: Activating sharedvg/sharedlv  <<==--- Device associated with the GFS2 filesystem
Apr 23 14:35:25 node1 kernel: dlm: Using TCP for communications
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: joining the lockspace group...
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: group event done 0 0
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: dlm_recover 1
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: add member 3
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: add member 2
Apr 23 14:35:25 node1 kernel: dlm: connecting to 3
Apr 23 14:35:25 node1 kernel: dlm: connecting to 2
Apr 23 14:35:25 node1 kernel: dlm: got connection from 3
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: add member 1
Apr 23 14:35:25 node1 kernel: dlm: got connection from 2
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: dlm_recover_members 3 nodes
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: join complete
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: generation 3 slots 3 1:2 2:3 3:1
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: dlm_recover_directory
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: dlm_recover_directory 0 in 0 new
Apr 23 14:35:25 node1 kernel: dlm: lvm_global: dlm_recover_directory 0 out 2 messages
Apr 23 14:35:26 node1 kernel: dlm: lvm_global: dlm_recover 1 generation 3 done: 126 ms
  1. While the lock manager is not yet started (or in process of being started), the activation of device associated with the XFS filesystem was triggered and it failed:
Apr 23 14:35:26 node1 LVM-activate(web-lv)[1460784]: INFO: Activating cluster_vg  <<==--- Device associated with the XFS filesystem
Apr 23 14:35:26 node1 LVM-activate(web-lv)[1460784]: ERROR: cluster_vg: failed to activate.
Apr 23 14:35:26 node1 pacemaker-controld[1459385]: notice: Result of start operation for web-lv on node1: error (cluster_vg: failed to activate.)
...
Apr 23 14:35:26 node1 kernel: dlm: lvm_sharedvg: joining the lockspace group...
Apr 23 14:35:26 node1 kernel: dlm: lvm_sharedvg: dlm_recover 1
Apr 23 14:35:26 node1 kernel: dlm: lvm_sharedvg: group event done 0 0
Apr 23 14:35:26 node1 kernel: dlm: lvm_sharedvg: add member 3
Apr 23 14:35:26 node1 kernel: dlm: lvm_sharedvg: add member 2
Apr 23 14:35:26 node1 kernel: dlm: lvm_sharedvg: add member 1
  1. Later it can be found that the lock manager is started [*] and the device associated with GFS2 filesystem is successfully activated:
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: dlm_recover_members 3 nodes
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: join complete
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: generation 3 slots 3 1:2 2:3 3:1
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: dlm_recover_directory
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: dlm_recover_directory 0 in 0 new
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: dlm_recover_directory 0 out 2 messages
Apr 23 14:35:27 node1 kernel: dlm: lvm_sharedvg: dlm_recover 1 generation 3 done: 83 ms
Apr 23 14:35:28 node1 LVM-activate(gfs-lv)[1460662]: INFO:  VG sharedvg starting dlm lockspace Starting locking. Waiting until locks are ready...  <<==-- [*]
Apr 23 14:35:29 node1 LVM-activate(gfs-lv)[1460662]: INFO: sharedvg/sharedlv: activated successfully.
Apr 23 14:35:29 node1 pacemaker-controld[1459385]: notice: Result of start operation for gfs-lv on node1: ok
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.