How can I activate cmirrord to carry out a pvmove in a RHEL 7 Resilient Storage cluster without having to restart my resources or applications?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 7 with the Resilient Storage Add-On
  • lvm2-cluster
  • resource-agents
  • An ocf:heartbeat:clvm resource configured in the CIB
    • Action is taken, or needs to be taken, to update the configuration attributes for that clvm resource while it is active and in use

Issue

  • We need to pvmove some clustered volumes onto new storage, and thus need to enable cmirrord. How can we do this without restarting clvm, GFS2 Filesystem resources, and other dependents?
  • If I update my clvm resource to have with_cmirrord, I get errors citing "unimplemented feature" and my resources are stopped.
  • clvm seems to fail if you update its configuration on the-fly, referencing error 3
Jul 22 12:05:05 node1 pengine[19033]:  notice: Reload  clvmd:0#011(Started node2)
Jul 22 12:05:05 node1 pengine[19033]:  notice: Reload  clvmd:1#011(Started node1)
Jul 22 12:05:05 node1 crmd[19034]:  notice: Initiating action 10: reload clvmd_reload_0 on node2
Jul 22 12:05:05 node1 crmd[19034]:  notice: Initiating action 11: reload clvmd_reload_0 on node1 (local)
Jul 22 12:05:05 node1 pengine[19033]:  notice: Calculated Transition 3278: /var/lib/pacemaker/pengine/pe-input-628.bz2
Jul 22 12:05:05 node1 crmd[19034]: warning: Action 10 (clvmd_reload_0) on node2 failed (target: 0 vs. rc: 3): Error
Jul 22 12:05:05 node1 crmd[19034]:  notice: Transition aborted by clvmd_start_0 'modify' on node2: Event failed (magic=0:3;10:3278:0:f10202f4-55f3-48bb-8de6-dee0706281f6, cib=0.290.1, source=match_graph_event:381, 0)
Jul 22 12:05:05 node1 crmd[19034]: warning: Action 10 (clvmd_reload_0) on node2 failed (target: 0 vs. rc: 3): Error
Jul 22 12:05:05 node1 crmd[19034]:  notice: Transition aborted by status-2-fail-count-clvmd, fail-count-clvmd=INFINITY: Transient attribute change (create cib=0.290.2, source=abort_unless_down:319, path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2'], 0)
Jul 22 12:05:05 node1 crmd[19034]:  notice: Operation clvmd_reload_0: unimplemented feature (node=node1, call=72, rc=3, cib-update=3351, confirmed=true)
Jul 22 12:05:05 node1 crmd[19034]:  notice: node1-clvmd_reload_0:72 [ usage: /usr/lib/ocf/resource.d/heartbeat/clvm {start|stop|monitor|validate-all|meta-data}\n\nExpects to have a fully populated OCF RA-compliant environment set.\n ]
Jul 22 12:05:05 node1 crmd[19034]: warning: Action 11 (clvmd_reload_0) on node1 failed (target: 0 vs. rc: 3): Error
Jul 22 12:05:05 node1 crmd[19034]: warning: Action 11 (clvmd_reload_0) on node1 failed (target: 0 vs. rc: 3): Error
Jul 22 12:05:05 node1 crmd[19034]:  notice: Transition 3278 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-628.bz2): Complete
Jul 22 12:05:05 node1 pengine[19033]: warning: Processing failed op start for clvmd:0 on node2: unimplemented feature (3)
Jul 22 12:05:05 node1 pengine[19033]:   error: Preventing clvmd-clone from re-starting on node2: operation start failed 'unimplemented feature' (3)
Jul 22 12:05:05 node1 pengine[19033]: warning: Processing failed op start for clvmd:0 on node2: unimplemented feature (3)
Jul 22 12:05:05 node1 pengine[19033]:   error: Preventing clvmd-clone from re-starting on node2: operation start failed 'unimplemented feature' (3)
Jul 22 12:05:05 node1 pengine[19033]: warning: Processing failed op start for clvmd:1 on node1: unimplemented feature (3)
Jul 22 12:05:05 node1 pengine[19033]:   error: Preventing clvmd-clone from re-starting on node1: operation start failed 'unimplemented feature' (3)
Jul 22 12:05:05 node1 pengine[19033]: warning: Processing failed op start for clvmd:1 on node1: unimplemented feature (3)
Jul 22 12:05:05 node1 pengine[19033]:   error: Preventing clvmd-clone from re-starting on node1: operation start failed 'unimplemented feature' (3)
Jul 22 12:05:05 node1 pengine[19033]: warning: Forcing clvmd-clone away from node1 after 1000000 failures (max=1000000)
Jul 22 12:05:05 node1 pengine[19033]: warning: Forcing clvmd-clone away from node1 after 1000000 failures (max=1000000)
Jul 22 12:05:05 node1 pengine[19033]: warning: Forcing clvmd-clone away from node2 after 1000000 failures (max=1000000)
Jul 22 12:05:05 node1 pengine[19033]: warning: Forcing clvmd-clone away from node2 after 1000000 failures (max=1000000)
Jul 22 12:05:05 node1 pengine[19033]:  notice: Stop    clvmd:0#011(node2)
Jul 22 12:05:05 node1 pengine[19033]:  notice: Stop    clvmd:1#011(node1)
# pcs status
[...]
 Clone Set: clvmd-clone [clvmd]
     Stopped: [ node1 node2 ]

Failed Actions:
* clvmd_start_0 on node2 'unimplemented feature' (3): call=72, status=complete, exitreason='none',
    last-rc-change='Fri Jul 22 12:05:06 2016', queued=0ms, exec=11ms
* clvmd_start_0 on node1 'unimplemented feature' (3): call=72, status=complete, exitreason='none',
    last-rc-change='Fri Jul 22 12:05:05 2016', queued=0ms, exec=37ms

Resolution

  • If this issue has occurred and the resource is stopped as a result, simply clean-up the resource to get it started again
# pcs resource cleanup <resource-name>
  • Workaround: To prevent this issue from occurring, do not update the configuration attributes for a clvm resource while it is in use and is depended on by other resources

  • Workaround: To activate cmirrord, plan to take an outage in any resources dependent on clvm, then update the configuration to include with_cmirrord="true", then clean up the resource as described above. That is:

# pcs resource update clvmd with_cmirrord="true"
# pcs resource cleanup clvmd

NOTE: This will restart clvmd and any dependent resources, so be prepared for an outage.

  • Workaround: To activate cmirrord without causing an outage to any resources, consider simply starting /usr/sbin/cmirrord manually from the command line
    • NOTE: Red Hat does not test such a configuration where cmirrord is managed outside of the clvm resource, and cannot state with certainty that issues will not arise in any circumstance. For example, if the clvm resource restarts or moves for unrelated reasons while cmirrord is running outside the resource, then unexpected behavior may occur. Other scenarios or conditions may result in additional problematic aspects. Use this workaround at your own risk.

Root Cause

This issue has been resolved via errata RHBA-2017:1844 - Bug Fix Advisory

The ocf:heartbeat:clvm resource contained a defect in which it advertised support for a reload action that it does not actually implement. When a resource-agent advertises such a reload action, pacemaker will utilize such an action if the configuration of any such resource is updated. That is, since clvm advertises support for reload, any change to a clvm resource's attributes will cause pacemaker to try to reload it. However, since that agent doesn't actually support that action, this reload attempt produces an "unimplemented feature" error that will cause the resource to fail on all nodes.

As such, there is currently no way to start cmirrord through the clvm agent without creating an outage for that resource and its dependents. An alternative option may be simply starting cmirrord manually, but as noted above, Red Hat cannot state with certainty there won't be issues with such a configuration. As such, it may be best to simply plan to take an outage when the clvm resource needs to be updated and cmirrord needs to be activated. Once it is activated, it may be wise to leave it as such, so that in future attempts to pvmove, cmirrord should already be running. If there are no mirrored clustered logical volumes or pvmoves in place, then having cmirrord running is expected to produce minimal/negligible resource utilization on the nodes, so there should be little downside to having it activated even when its not needed.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.