A node shuts down pacemaker after getting fenced and restarting corosync and pacemaker

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 9 (with the High Availability Add-on)
  • Google Cloud Platform (optional)

Issue

  • In theory, this issue can happen on any platform if timing is unlucky, though it may be more likely on Google Cloud Platform due to the way the fence_gce fence agent performs a reboot.
    • Generic case: A node left the corosync membership due to token loss. After a stonith action against the node was initiated and before the node was rebooted, the node rejoined the corosync membership. After the node rebooted and started cluster services, it received a "We were allegedly just fenced" message and shut down its pacemaker and corosync services.

        # # In this example, token loss occurred at 01:27:57 due to a network issue, after the token timeout expired.
        # # A new one-node membership reflecting token loss formed at 01:28:21, after the consensus timeout expired.
        # # Node 1 initiated a stonith action against node 2.
        # # Node 2 rejoined the corosync membership at 01:28:23, when the network issue was resolved.
        # # A new two-node membership formed, with node 2 back in the CPG group.
        May  4 01:27:57 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
        May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f4) was formed. Members left: 2
        May  4 01:28:21 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
        ...
        May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
        ...
        May  4 01:28:22 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using xvm2 for 20s
        May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2
        May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
        May  4 01:28:23 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
        
        # # At 01:28:45, node 1 received confirmation that node 2 had been successfully rebooted.
        May  4 01:28:45 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
        
        # # In order to fully complete the stonith action, it needed to deliver the confirmation message to
        # # all nodes in the CPG group. Node 2 was still in the CPG group from the rejoin at 01:28:23.
        # # A new membership without node 2 had not yet been formed, because
        # # (token timeout + consensus timeout) had not yet expired since the reboot.
        # # So the message was not delivered until node 2 started cluster services after boot.
        # # In receiving this message, node 2 received notification that it had been fenced.
        # # So it shut itself down in response.
        # # Node 1:
        May  4 01:29:01 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A processor failed, forming new configuration.
        May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2
        May  4 01:29:09 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[2]: 1 2
        May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK
        May  4 01:29:09 fastvm-rhel-8-0-23 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK
        ...
        May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [CFG   ] Node 2 was shut down by sysadmin
        May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [TOTEM ] A new membership (1.11700) was formed. Members left: 2
        May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [QUORUM] Members[1]: 1
        May  4 01:29:10 fastvm-rhel-8-0-23 corosync[1722]:  [MAIN  ] Completed service synchronization, ready to provide service.
        
        # # Node 2:
        May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1
        May 04 01:29:09 [1155] fastvm-rhel-8-0-24 corosync notice  [QUORUM] Members[2]: 1 2
        May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-fenced    [1319] (remote_op_done)  notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK | id=b69b57a1
        May 04 01:29:09 fastvm-rhel-8-0-24 pacemaker-controld  [1323] (tengine_stonith_notify)  crit: We were allegedly just fenced by node1 for node1!
      
    • GCP case: A Google Compute Engine (GCE) VM got fenced by the fence_gce agent and rebooted. It rejoined the cluster before the fence action completed. Shortly thereafter, it shut down its pacemaker and corosync services and left the cluster.

        # # In this example, node 2 was rebooted at 23:27:15.
        # # It rejoined the cluster at 23:27:23.
        # # Then at 23:28:12, the fence action was declared complete,
        # # and node 2 shut down its cluster services.
        
        # # Node 1
        Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)'
        Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
        Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
        Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
        Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration.
        Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2
        Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2
        Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG   ] downlist left_list: 1 received
        Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
        Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN  ] Completed service synchronization, ready to provide service.
        ...
        Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
        ...
        Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)
        
        # # Node 2
        Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl.
        Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M).
        ...
        Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK
        Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
        Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply
        Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
        Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
      

Resolution

Red Hat Enterprise Linux 8


If you encounter this issue, see the workaround regardless of which release or platform you're using.

Note: Google has improved the API so that it declares a reset operation to be complete more quickly. In our testing thus far, the reset operation is now declared complete before the fenced node boots up. Thus, the issue described in the Root Cause section may no longer occur. However, this is not guaranteed, so you may still choose to add a corosync startup delay as shown in the workaround for added safety.

Workaround


Set a delay for before the `corosync` service's startup.
  • In the generic case, this should allow time for a new CPG membership to form without the fenced node, completing the stonith reboot operation by allowing the completion message to be delivered to all nodes in the CPG membership.
  • In the GCP case, this should allow time for Google Cloud to send the "reset complete" notification to the fence agent, completing the stonith reboot operation.

To do this, create a systemd drop-in file by running systemctl edit corosync.service, adding the options as shown below, and saving and exiting the text editor.

# systemctl edit corosync.service
[Service]
ExecStartPre=/bin/sleep 60

NOTE: If the systemctl edit corosync.service command above reports an error Editing "/etc/systemd/system/corosync.service.d/override.conf" canceled: temporary file is empty., then use the below command with the --full option instead

# systemctl --full edit corosync.service
[Service]
ExecStartPre=/bin/sleep 60

Then reload the systemd manager configuration.

# systemctl daemon-reload

For more information, see the following solution: How to add customizations to a service unit or override the unit without modifying the original.

Note: The default timeout value for systemd service start operation is 90s. So if the value for sleep is large enough where the corosync.service does not complete its start operation within 90s window, then the corosync.service will itself reach a failed state:

Sep 10 14:13:42 rhel8-10_1 systemd[1]: Starting Corosync Cluster Engine...
Sep 10 14:15:12 rhel8-10_1 systemd[1]: corosync.service: start-pre operation timed out. Terminating.
Sep 10 14:15:12 rhel8-10_1 systemd[1]: corosync.service: Failed with result 'timeout'.
Sep 10 14:15:12 rhel8-10_1 systemd[1]: Failed to start Corosync Cluster Engine.

If the workaround for sleep 60 still falls short, then a delay by increasing the GRUB menu selection timeout can be used in conjunction.


If sbd is used for fencing, SBD_STARTUP_DELAY can be set to twice the corosync token timeout. This option can be set when the sbd device is enable or by editing /etc/sysconfig/sbd on all the cluster nodes.

Root Cause

Generic case

In order to complete a stonith action, pacemaker must deliver a remote_op_done message (indicating completion) to all nodes in the corosync CPG membership.

When node1 stops receiving corosync communication from node2, there are basically two timeouts of interest:

  • The totem token timeout determines how long node1 waits to declare that node2's token has been lost.
  • The consensus timeout determines how long node waits to start a new round of membership configuration. This defaults to 1.2 * (token timeout).

If the timing of events is interwoven in a particularly unfortunate way, the following temporal sequence is possible:

  1. node1 declares that node2's token has been lost (likely due to a network issue).
  2. The network issue gets resolved but corosync takes some time to confirm this.
  3. A new CPG membership is formed on node1, with node2 absent.
  4. node1 initiates a stonith action against node2 as a result of (3).
  5. node2 rejoins the CPG membership (after the network issue was resolved in (2)).
  6. node1 successfully reboots node2. (Rebooting node2 causes it to stop sending corosync communication. So there will eventually be another token loss event and node2 will exit the CPG membership. But this can happen only after the timeouts expire, which has not yet occurred.)
  7. A remote_op_done confirmation message must be delivered to every node in the CPG membership in order to complete the stonith action. node2 is still part of the CPG membership even though it's offline, because the timeouts have not expired. So node1 repeatedly and unsuccessfully tries to deliver the confirmation message to node2. The stonith action remains incomplete in the meantime.
  8. While node1 is still trying to deliver the confirmation message to node2 (before the consensus timeout expires), node2 boots back up and starts cluster services.
  9. node2 receives the stonith confirmation message. This message essentially says, "node2 was just fenced".
  10. node2's pacemaker daemon, having just received a message stating that node2 was fenced, shuts itself and corosync down for safety.

Google Cloud Platform case

Note: Google has improved the API so that it declares a reset operation to be complete more quickly. In our testing thus far, the reset operation is now declared complete before the fenced node boots up. Thus, the issue described below in this section may no longer occur.

Power fencing agents have an attribute called method that specifies how to perform a reboot operation. The allowed values are onoff and cycle, and the default value set within the fencing library is onoff. However, the fence_gce fence agent overrides this default. It sets its own default value to cycle.

When method=cycle is used, fence_gce sends a reset call to the Google Cloud API. Then once per second, it checks the status of the reset action that it spawned.

The problem is that the Google Cloud API doesn't declare the reset call to be a success until long after the node has already been powered back on. In our testing, the reset call resets the node within about 20 seconds, allowing it to boot back up and rejoin the cluster (23:27:36 in the example below). But the API takes about 60 seconds to declare the reset action complete (23:28:12 in the example below).

Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)

Pacemaker doesn't consider the fence action complete until the API has declared the reset action complete. This means the fenced node may reboot and rejoin the cluster before the fence action is declared a success.

Then, when Pacemaker declares the fence operation a success, the fenced node shuts down Pacemaker and Corosync because it has received notification that it was fenced.

Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366@nwahl-rhel7-node1.c3382af8: OK
Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure

API logs can be found in This content is not included.Bug 1906978 - fence_gce: method=cycle doesn't complete before fenced node rejoins cluster [RHEL 8].

Diagnostic Steps

See Issue.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.