Split-Brain occurs on two-node pacmeaker cluster while stonith is disabled and token loss occurs

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux Server 7 and higher (with the High Availability Add On and Resilient Storage Add Ons)
  • Pacemaker Cluster
  • Two Node Cluster

Issue

  • In a "two-node" cluster, we encounter an issue where both nodes show the other offline, and both servers reflect active resources:

    • Node 01:
    [root@rhel8-node1 ~]# pcs status
    Cluster name: rhel8-cluster
    Cluster Summary:
    -----------------------------------------8<----------------------------------------- 
    Node List:
      * Online: [ rhel8-node1 ]
      * OFFLINE: [ rhel8-node2 ] <--- Node1 sees node2 as offline
    
    Full List of Resources:
    -----------------------------------------8<-----------------------------------------   
      * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node1 <--- all resources active on each
      * xvmfence	(stonith:fence_xvm):	 Started rhel8-node1          individual "cluster instance"
      * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node1
      * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node1
      * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node1
    
    • Node 02:
    [root@rhel8-node2 ~]# pcs status
    Cluster name: rhel8-cluster
    Cluster Summary:
    -----------------------------------------8<----------------------------------------- 
    Node List:
      * Online: [ rhel8-node2 ]
      * OFFLINE: [ rhel8-node1 ]  <--- Node2 sees node1 as offline
    
    Full List of Resources:
    -----------------------------------------8<----------------------------------------- 
      * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node2
      * xvmfence	(stonith:fence_xvm):	 Stopped <--- because network down
      * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node2
      * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node2
      * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node2
    

Resolution

Having an enabled stonith would avoid this type of issue and having a working a stonith device is considered a requirement for support due to preventable issues such as this, and other unpredictable behavours that can occur when stonith is not enabled. Due to this, and in order to avoid issues such as this the stonith option should be set to enabled, and a known working stonith device should be configured within the cluster to avoid this issue:

$ pcs property set stonith-enabled=true

Root Cause

During a token loss event, and when cluster nodes lose communication with one another, this will create "split partitions" or in other words multiple individual full cluster instances. When this split occurs, by design voter quorum will remain whith which ever cluster partition which maintains a majority vote of all nodes ( 50% + 1 votes ), and only that partition will maintain capabilities to perform cluster actions such as stonith or other recovery actions. This meaning the partition without quorum can only be fenced or is subject to the action determined by the no-quorum-policy in cases where fencing is disabled ( stop all resources by default ).

Two node clusters however are a special case. Because a network split can only ever result in a 50 / 50 split and not a full majority, the special two_node option is available to ensure that a two node cluster can maintain quorum in each partition and perform necessary fencing actions in order to recover from the split.

Under the example cases provided in the "Diagnostic Steps" section, stonith is ran with the stonith-enabled option both enabled and disabled. In the disabled case, instead of each node trying to fence each other, in this case both nodes maintain quorum and skip the stonith process.

  • Since both nodes have quorum, both nodes become a fully promoted cluster instance and each run their own resources.
  • This is not at all supported, and can lead to data corruption on any cluster managed storage and filesystem resources which become active on both nodes when they should not be.

So the solution here and in order to avoid this "split" cluster instance issue it is required to keep the stonith option enabled.

Diagnostic Steps

Two node example stonith case with and without stonith enabled ( stonith disabled is unsupported ) :

Stonith enabled case ( expected case ):

Resource activity prior to network split:

[root@rhel8-node1 ~]# pcs status
Cluster name: rhel8-cluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rhel8-node1 (version 2.1.7-5.2.el8_10-0f7f88312) - partition with quorum
  * Last updated: Mon Feb 10 10:39:27 2025 on rhel8-node1
  * Last change:  Mon Feb 10 10:39:24 2025 by root via root on rhel8-node1
  * 2 nodes configured
  * 21 resource instances configured (7 DISABLED)

Node List:
  * Online: [ rhel8-node1 rhel8-node2 ]

Full List of Resources:
* postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node2
* xvmfence	(stonith:fence_xvm):	 Started rhel8-node2
* temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node1
* temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node2
* temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node1

Network loss initiated, and this creates a token loss ( split-brain ) event:

[root@rhel8-node1 ~]# pcs status
Cluster name: rhel8-cluster
Cluster Summary:
-----------------------------------------8<----------------------------------------- 
Node List:
  * Node rhel8-node2: UNCLEAN (offline)
  * Online: [ rhel8-node1 ]

Full List of Resources:

  * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node2 (UNCLEAN)
  * kdump	(stonith:fence_kdump):	 Started rhel8-node1
  * xvmfence	(stonith:fence_xvm):	 Started [ rhel8-node1 rhel8-node2 ]
  * delay	(ocf::heartbeat:Delay):	 Started rhel8-node1
  * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node2 (UNCLEAN)
  * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node2 (UNCLEAN)
  * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node1

Pending Fencing Actions:
  * reboot of rhel8-node2 pending: client=pacemaker-controld.2351, origin=rhel8-node1

In below case, stonith is enabled so both nodes go to fence each other, but node 1 wins the fence race:

# Node 01 detects "token loss" and goes to fence:
$ cat /var/log/messages
-----------------------------------------8<----------------------------------------- 
Feb 10 10:56:41 rhel8-node1 corosync[2321]:  [KNET  ] link: host: 2 link: 0 is down
Feb 10 10:56:41 rhel8-node1 corosync[2321]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 10 10:56:41 rhel8-node1 corosync[2321]:  [KNET  ] host: host: 2 has no active links
Feb 10 10:56:48 rhel8-node1 pacemaker-execd[2336]: notice: Could not send alert: Timed Out (Alert agent did not complete within 2m)
Feb 10 10:56:49 rhel8-node1 corosync[2321]:  [TOTEM ] Token has not been received in 15000 ms
Feb 10 10:56:54 rhel8-node1 corosync[2321]:  [TOTEM ] A processor failed, forming new configuration: token timed out (20000ms), waiting 24000ms for consensus.
Feb 10 10:57:18 rhel8-node1 corosync[2321]:  [QUORUM] Sync members[1]: 1
Feb 10 10:57:18 rhel8-node1 corosync[2321]:  [QUORUM] Sync left[1]: 2
Feb 10 10:57:18 rhel8-node1 corosync[2321]:  [TOTEM ] A new membership (1.9cb51) was formed. Members left: 2
Feb 10 10:57:18 rhel8-node1 corosync[2321]:  [TOTEM ] Failed to receive the leave message. failed: 2
Feb 10 10:57:18 rhel8-node1 corosync[2321]:  [QUORUM] Members[1]: 1
Feb 10 10:57:18 rhel8-node1 corosync[2321]:  [MAIN  ] Completed service synchronization, ready to provide service.
Feb 10 10:57:18 rhel8-node1 pacemaker-attrd[2337]: notice: Node rhel8-node2 state is now lost
Feb 10 10:57:18 rhel8-node1 pacemaker-attrd[2337]: notice: Removing all rhel8-node2 attributes for node loss
Feb 10 10:57:18 rhel8-node1 pacemaker-attrd[2337]: notice: Purged 1 peer with id=2 and/or uname=rhel8-node2 from the membership cache
Feb 10 10:57:18 rhel8-node1 pacemaker-fenced[2335]: notice: Node rhel8-node2 state is now lost
-----------------------------------------8<----------------------------------------- 
Feb 10 10:57:18 rhel8-node1 pacemaker-schedulerd[2338]: warning: Cluster node rhel8-node2 will be fenced: peer is no longer part of the cluster
# Node 02 detects "token loss" and would fence, and would try to fence, 
# but is pre-empted by node 01's  Stonith action:
$ cat /var/log/messages
-----------------------------------------8<----------------------------------------- 
Feb 10 10:56:39 rhel8-node2 corosync[2422]:  [KNET  ] link: host: 1 link: 0 is down
Feb 10 10:56:39 rhel8-node2 corosync[2422]:  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 10 10:56:39 rhel8-node2 corosync[2422]:  [KNET  ] host: host: 1 has no active links
Feb 10 10:56:39 rhel8-node2 corosync[2422]:  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 10 10:56:39 rhel8-node2 corosync[2422]:  [KNET  ] host: host: 1 has no active links
Feb 10 10:56:49 rhel8-node2 corosync[2422]:  [TOTEM ] Token has not been received in 15000 ms
Feb 10 10:56:54 rhel8-node2 corosync[2422]:  [TOTEM ] A processor failed, forming new configuration: token timed out (20000ms), waiting 24000ms for consensus.
Feb 10 10:57:28 rhel8-node2 kernel: Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-553.28.1.el8_10.case03955761.g306c.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet console=tty0 console=ttyS0,115200

Node 01 wins the fence race, and node 02 is fenced / rebooted as a result:

$ cat /var/log/messages
-----------------------------------------8<----------------------------------------- 
Feb 10 10:57:18 rhel8-node1 pacemaker-fenced[2335]: notice: Requesting peer fencing (reboot) targeting rhel8-node2
Feb 10 10:57:18 rhel8-node1 pacemaker-fenced[2335]: notice: Requesting that rhel8-node1 perform 'reboot' action targeting rhel8-node2
Feb 10 10:57:18 rhel8-node1 pacemaker-controld[2339]: notice: Requesting local execution of start operation for xvmfence on rhel8-node1
Feb 10 10:57:20 rhel8-node1 fence_xvm[4251]: Domain "rhel8-node2" is ON
Feb 10 10:57:20 rhel8-node1 pacemaker-fenced[2335]: notice: Operation 'reboot' [4251] targeting rhel8-node2 using xvmfence returned 0
Feb 10 10:57:20 rhel8-node1 pacemaker-fenced[2335]: notice: Operation 'reboot' targeting rhel8-node2 by rhel8-node1 for pacemaker-controld.2339@rhel8-node1: OK (complete)

[root@rhel8-node1 ~]# pcs status --full
Cluster name: rhel8-cluster
Cluster Summary:
-----------------------------------------8<-----------------------------------------  
Node List:
  * Node rhel8-node1 (1): online, feature set 3.19.0
  * Node rhel8-node2 (2): OFFLINE

Full List of Resources:

  * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node1
  * xvmfence	(stonith:fence_xvm):	 Started rhel8-node1
  * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node1
  * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node1
  * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node1

Node Attributes:
  * Node: rhel8-node1 (1):
    * pingd                           	: 2         
    * postgresql-data-status          	: LATEST    

Migration Summary:

Fencing History:
  * reboot of rhel8-node2 successful: delegate=rhel8-node1, client=pacemaker-controld.2339, origin=rhel8-node1, completed='2025-02-10 10:57:20.055000 -07:00'

Stonith disabled case ( unsupported case ):

When stonith is disabled however on a two-node cluster, instead what we get is both nodes maintain quorum, but don't fence. This combination results in a "split-brain" with two separate cluster instances:

Pcs status before network split ( w/ stonith disabled ):

[root@rhel8-node1 ~]# pcs status
Cluster name: rhel8-cluster
Cluster Summary:
-----------------------------------------8<----------------------------------------- 
Node List:
  * Online: [ rhel8-node1 rhel8-node2 ]

Full List of Resources:

  * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node1
  * xvmfence	(stonith:fence_xvm):	 Started rhel8-node1
  * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node1
  * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node1
  * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node1

Network loss initiated, and this creates a token loss ( split-brain ) event. No stonith is scheduled though since it is disabled:

$ grep -v -e trace: -e debug: /var/log/pacemaker/pacemaker.log
-----------------------------------------8<----------------------------------------- 
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (node_left@cpg.c:658)    info: Group attrd event 6: rhel8-node2 (node 2 pid 2381) left via cluster exit
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (crm_update_peer_proc@membership.c:1011)         info: node_left: Node rhel8-node2[2] - corosync-cpg is now offline
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (update_peer_state_iter@membership.c:1139)       notice: Node rhel8-node2 state is now lost | nodeid=2 previous=member source=crm_update_peer_proc
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (attrd_peer_remove@attrd_corosync.c:517)         notice: Removing all rhel8-node2 attributes for node loss | without reaping node from cache
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (crm_reap_dead_member@membership.c:358)  info: Removing node with name rhel8-node2 and id 2 from membership cache
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (reap_crm_member@membership.c:393)       notice: Purged 1 peer with id=2 and/or uname=rhel8-node2 from the membership cache
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (pcmk_cpg_membership@cpg.c:737)  info: Group attrd event 6: rhel8-node1 (node 1 pid 2337) is member
Feb 10 11:08:33 rhel8-node1 pacemaker-controld  [2339] (quorum_notification_cb@corosync.c:288)  info: Quorum retained | membership=641899 members=1
Feb 10 11:08:33 rhel8-node1 pacemaker-controld  [2339] (update_peer_state_iter@membership.c:1139)       notice: Node rhel8-node2 state is now lost | nodeid=2 previous=member source=pcmk__reap_unseen_nodes
$ grep -v -e trace: -e debug: /var/log/pacemaker/pacemaker.log
-----------------------------------------8<----------------------------------------- 
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (crm_update_peer_proc@membership.c:1011)         info: node_left: Node rhel8-node2[2] - corosync-cpg is now offline
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (update_peer_state_iter@membership.c:1139)       notice: Node rhel8-node2 state is now lost | nodeid=2 previous=member source=crm_update_peer_proc
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (attrd_peer_remove@attrd_corosync.c:517)         notice: Removing all rhel8-node2 attributes for node loss | without reaping node from cache
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (crm_reap_dead_member@membership.c:358)  info: Removing node with name rhel8-node2 and id 2 from membership cache
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (reap_crm_member@membership.c:393)       notice: Purged 1 peer with id=2 and/or uname=rhel8-node2 from the membership cache
Feb 10 11:08:33 rhel8-node1 pacemaker-attrd     [2337] (pcmk_cpg_membership@cpg.c:737)  info: Group attrd event 6: rhel8-node1 (node 1 pid 2337) is member
Feb 10 11:08:33 rhel8-node1 pacemaker-controld  [2339] (quorum_notification_cb@corosync.c:288)  info: Quorum retained | membership=641899 members=1
Feb 10 11:08:33 rhel8-node1 pacemaker-controld  [2339] (update_peer_state_iter@membership.c:1139)       notice: Node rhel8-node2 state is now lost | nodeid=2 previous=member source=pcmk__reap_unseen_nodes
Feb 10 11:08:33 rhel8-node1 pacemaker-controld  [2339] (peer_update_callback@controld_callbacks.c:157)  info: Cluster node rhel8-node2 is now lost (was member)

Because node 02 retained quorum, resources also started there. This only occurs because we retained quorum ( due to two_node option ) and stonith is disabled. This is not the wanted outcome and is an unsafe state to be in:

$ grep "Result of st" /var/log/pacemaker/pacemaker.log
-----------------------------------------8<----------------------------------------- 
Feb 10 11:08:34 rhel8-node2 pacemaker-controld  [2383] (log_executor_event) 	notice: Result of start operation for temp3 on rhel8-node2: ok | graph action confirmed; call=69 key=temp3_start_0 rc=0
Feb 10 11:08:34 rhel8-node2 pacemaker-controld  [2383] (log_executor_event) 	notice: Result of start operation for temp1 on rhel8-node2: ok | graph action confirmed; call=70 key=temp1_start_0 rc=0
Feb 10 11:08:34 rhel8-node2 pacemaker-controld  [2383] (log_executor_event) 	notice: Result of start operation for temp2 on rhel8-node2: ok | graph action confirmed; call=68 key=temp2_start_0 rc=0
Feb 10 11:08:34 rhel8-node2 pacemaker-controld  [2383] (log_executor_event) 	notice: Result of start operation for pgvip on rhel8-node2: ok | graph action confirmed; call=67 key=pgvip_start_0 rc=0
Feb 10 11:08:35 rhel8-node2 pacemaker-controld  [2383] (log_executor_event) 	notice: Result of start operation for postgresql on rhel8-node2: ok | graph action confirmed; call=80 key=postgresql_start_0 rc=0

Both nodes end up with instances of the same resources active on both nodes, which can lead to data corruption:

[root@rhel8-node1 ~]# pcs status
Cluster name: rhel8-cluster
Cluster Summary:
-----------------------------------------8<----------------------------------------- 
Node List:
  * Online: [ rhel8-node1 ]
  * OFFLINE: [ rhel8-node2 ] <--- Node1 sees node2 as offline

Full List of Resources:
-----------------------------------------8<-----------------------------------------   
  * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node1 <--- all resources active on each
  * xvmfence	(stonith:fence_xvm):	 Started rhel8-node1          individual "cluster instance"
  * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node1
  * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node1
  * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node1
[root@rhel8-node2 ~]# pcs status
Cluster name: rhel8-cluster
Cluster Summary:
-----------------------------------------8<----------------------------------------- 
Node List:
  * Online: [ rhel8-node2 ]
  * OFFLINE: [ rhel8-node1 ]  <--- Node2 sees node1 as offline

Full List of Resources:
-----------------------------------------8<----------------------------------------- 
  * postgresql	(ocf::heartbeat:pgsql):	 Started rhel8-node2 <--- all resources active on each
  * xvmfence	(stonith:fence_xvm):	 Stopped                  cluster instance
  * temp1	(ocf::heartbeat:Dummy):	 Started rhel8-node2
  * temp2	(ocf::heartbeat:Dummy):	 Started rhel8-node2
  * temp3	(ocf::heartbeat:Dummy):	 Started rhel8-node2
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.