A node shuts down pacemaker after getting fenced and restarting corosync and pacemaker
Environment
- Red Hat Enterprise Linux 7 (with the High Availability Add-on)
- Red Hat Enterprise Linux 8 (with the High Availability Add-on)
- Red Hat Enterprise Linux 9 (with the High Availability Add-on)
- Google Cloud Platform (optional)
Issue
- In theory, this issue can happen on any platform if timing is unlucky, though it may be more likely on Google Cloud Platform due to the way the
fence_gcefence agent performs a reboot.-
Generic case: A node left the
corosyncmembership due to token loss. After a stonith action against the node was initiated and before the node was rebooted, the node rejoined thecorosyncmembership. After the node rebooted and started cluster services, it received a"We were allegedly just fenced"message and shut down itspacemakerandcorosyncservices.# # In this example, token loss occurred at 01:27:57 due to a network issue, after the token timeout expired. # # A new one-node membership reflecting token loss formed at 01:28:21, after the consensus timeout expired. # # Node 1 initiated a stonith action against node 2. # # Node 2 rejoined the corosync membership at 01:28:23, when the network issue was resolved. # # A new two-node membership formed, with node 2 back in the CPG group. May 4 01:27:57 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A processor failed, forming new configuration. May 4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.116f4) was formed. Members left: 2 May 4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[1]: 1 ... May 4 01:28:22 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster ... May 4 01:28:22 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using xvm2 for 20s May 4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2 May 4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[2]: 1 2 May 4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service. # # At 01:28:45, node 1 received confirmation that node 2 had been successfully rebooted. May 4 01:28:45 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK) # # In order to fully complete the stonith action, it needed to deliver the confirmation message to # # all nodes in the CPG group. Node 2 was still in the CPG group from the rejoin at 01:28:23. # # A new membership without node 2 had not yet been formed, because # # (token timeout + consensus timeout) had not yet expired since the reboot. # # So the message was not delivered until node 2 started cluster services after boot. # # In receiving this message, node 2 received notification that it had been fenced. # # So it shut itself down in response. # # Node 1: May 4 01:29:01 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A processor failed, forming new configuration. May 4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2 May 4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[2]: 1 2 May 4 01:29:09 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK May 4 01:29:09 fastvm-rhel-8-0-23 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK ... May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [CFG ] Node 2 was shut down by sysadmin May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [TOTEM ] A new membership (1.11700) was formed. Members left: 2 May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [QUORUM] Members[1]: 1 May 4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service. # # Node 2: May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1 May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice [QUORUM] Members[2]: 1 2 May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-fenced [1319] (remote_op_done) notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=b69b57a1 May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-controld [1323] (tengine_stonith_notify) crit: We were allegedly just fenced by node1 for node1! -
GCP case: A Google Compute Engine (GCE) VM got fenced by the
fence_gceagent and rebooted. It rejoined the cluster before the fence action completed. Shortly thereafter, it shut down itspacemakerandcorosyncservices and left the cluster.# # In this example, node 2 was rebooted at 23:27:15. # # It rejoined the cluster at 23:27:23. # # Then at 23:28:12, the fence action was declared complete, # # and node 2 shut down its cluster services. # # Node 1 Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)' Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2 Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration. Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2 Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2 Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG ] downlist left_list: 1 received Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1 Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN ] Completed service synchronization, ready to provide service. ... Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2 ... Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]: notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK) # # Node 2 Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl. Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M). ... Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]: crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1! Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
-
Resolution
Red Hat Enterprise Linux 8
- This issue This content is not included.RHEL-7644 (Pacemaker shuts down if a node is in the CPG membership when a stonith action against it completes) has been closed with the resolution set to
Won't Dofor RHEL 8. For more information on why this issue was closed then review the issue This content is not included.RHEL-7644.
If you encounter this issue, see the workaround regardless of which release or platform you're using.
Note: Google has improved the API so that it declares a reset operation to be complete more quickly. In our testing thus far, the reset operation is now declared complete before the fenced node boots up. Thus, the issue described in the Root Cause section may no longer occur. However, this is not guaranteed, so you may still choose to add a corosync startup delay as shown in the workaround for added safety.
Workaround
Set a delay for before the `corosync` service's startup.
- In the generic case, this should allow time for a new CPG membership to form without the fenced node, completing the stonith reboot operation by allowing the completion message to be delivered to all nodes in the CPG membership.
- In the GCP case, this should allow time for Google Cloud to send the "reset complete" notification to the fence agent, completing the stonith reboot operation.
To do this, create a systemd drop-in file by running systemctl edit corosync.service, adding the options as shown below, and saving and exiting the text editor.
# systemctl edit corosync.service
[Service]
ExecStartPre=/bin/sleep 60
NOTE: If the systemctl edit corosync.service command above reports an error Editing "/etc/systemd/system/corosync.service.d/override.conf" canceled: temporary file is empty., then use the below command with the --full option instead
# systemctl --full edit corosync.service
[Service]
ExecStartPre=/bin/sleep 60
Then reload the systemd manager configuration.
# systemctl daemon-reload
For more information, see the following solution: How to add customizations to a service unit or override the unit without modifying the original.
Note: The default timeout value for systemd service start operation is 90s. So if the value for sleep is large enough where the corosync.service does not complete its start operation within 90s window, then the corosync.service will itself reach a failed state:
Sep 10 14:13:42 rhel8-10_1 systemd[1]: Starting Corosync Cluster Engine...
Sep 10 14:15:12 rhel8-10_1 systemd[1]: corosync.service: start-pre operation timed out. Terminating.
Sep 10 14:15:12 rhel8-10_1 systemd[1]: corosync.service: Failed with result 'timeout'.
Sep 10 14:15:12 rhel8-10_1 systemd[1]: Failed to start Corosync Cluster Engine.
- For more information: 1.2. Systemd unit files locations
If the workaround for sleep 60 still falls short, then a delay by increasing the GRUB menu selection timeout can be used in conjunction.
If sbd is used for fencing, SBD_STARTUP_DELAY can be set to twice the corosync token timeout. This option can be set when the sbd device is enable or by editing /etc/sysconfig/sbd on all the cluster nodes.
Related Articles
Root Cause
Generic case
In order to complete a stonith action, pacemaker must deliver a remote_op_done message (indicating completion) to all nodes in the corosync CPG membership.
When node1 stops receiving corosync communication from node2, there are basically two timeouts of interest:
- The totem token timeout determines how long
node1waits to declare thatnode2's token has been lost. - The consensus timeout determines how long
nodewaits to start a new round of membership configuration. This defaults to1.2 * (token timeout).
If the timing of events is interwoven in a particularly unfortunate way, the following temporal sequence is possible:
node1declares thatnode2's token has been lost (likely due to a network issue).- The network issue gets resolved but
corosynctakes some time to confirm this. - A new CPG membership is formed on
node1, withnode2absent. node1initiates a stonith action againstnode2as a result of (3).node2rejoins the CPG membership (after the network issue was resolved in (2)).node1successfully rebootsnode2. (Rebootingnode2causes it to stop sendingcorosynccommunication. So there will eventually be another token loss event andnode2will exit the CPG membership. But this can happen only after the timeouts expire, which has not yet occurred.)- A
remote_op_doneconfirmation message must be delivered to every node in the CPG membership in order to complete the stonith action.node2is still part of the CPG membership even though it's offline, because the timeouts have not expired. Sonode1repeatedly and unsuccessfully tries to deliver the confirmation message tonode2. The stonith action remains incomplete in the meantime. - While
node1is still trying to deliver the confirmation message tonode2(before the consensus timeout expires),node2boots back up and starts cluster services. node2receives the stonith confirmation message. This message essentially says, "node2was just fenced".node2'spacemakerdaemon, having just received a message stating thatnode2was fenced, shuts itself andcorosyncdown for safety.
Google Cloud Platform case
Note: Google has improved the API so that it declares a reset operation to be complete more quickly. In our testing thus far, the reset operation is now declared complete before the fenced node boots up. Thus, the issue described below in this section may no longer occur.
Power fencing agents have an attribute called method that specifies how to perform a reboot operation. The allowed values are onoff and cycle, and the default value set within the fencing library is onoff. However, the fence_gce fence agent overrides this default. It sets its own default value to cycle.
When method=cycle is used, fence_gce sends a reset call to the Google Cloud API. Then once per second, it checks the status of the reset action that it spawned.
The problem is that the Google Cloud API doesn't declare the reset call to be a success until long after the node has already been powered back on. In our testing, the reset call resets the node within about 20 seconds, allowing it to boot back up and rejoin the cluster (23:27:36 in the example below). But the API takes about 60 seconds to declare the reset action complete (23:28:12 in the example below).
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]: notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]: notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)
Pacemaker doesn't consider the fence action complete until the API has declared the reset action complete. This means the fenced node may reboot and rejoin the cluster before the fence action is declared a success.
Then, when Pacemaker declares the fence operation a success, the fenced node shuts down Pacemaker and Corosync because it has received notification that it was fenced.
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK
Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]: crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
API logs can be found in This content is not included.Bug 1906978 - fence_gce: method=cycle doesn't complete before fenced node rejoins cluster [RHEL 8].
Diagnostic Steps
See Issue.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.