What is the proper way to simulate a network failure on a RHEL Cluster?

Solution Verified - Updated 31 Jul 2024

Environment

Red Hat Enterprise Linux Server 5, 6, 7, 8, or 9 (with the High Availability Add on)

Issue

What is the proper way to simulate a network failure on a RHEL Cluster?
Is doing an ifdown and ifup a proper way to simulate that the network is down on the heartbeat network for a RHEL cluster node?
Why didn't my service group failover when I performed an ifdown on the heartbeat interface?

Resolution

There are a few ways to simulate a network failure. In all the testing procedures below(or any others not document) there is one requirement for the procedure to be a valid method for testing a network interface failure which is that the network interface always has to have the flags of UP and LOWER_UP.

For example here is output showing the different flags for a network device:

# ip link show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:3a:ce:bd brd ff:ff:ff:ff:ff:ff

The flag LOWER_UP is the physical layer link(also known as layer 1) flag which means if the LOWER_UP was set then your Ethernet cable was plugged in and connected to a network(could see another connected device). The UP flag refers to the data link layer (also known as layer 2), which means that you can send and receive ethernet packets.

The LOWER_UP is always enabled on virtual machines(KVM) and is unreliable method to determine actual state of physical link. The only reliable testing methods are the ones outlined below for virtual machines and physical machines which leave the network interfaces in a valid state for testing network faults.

List of supported testing procedure to simulate a faulty network

In all methods to simulate a faulty network, if the cluster is using [redundant heartbeat network or RRP](/solutions/61832) or [network bonding](/articles/3068841) then then all network paths will need to made faulty in order for the cluster node to lose communication with the other nodes.

For all commands that are listed below will need to be run as the root user.

Physical machines only

Pull the network cable that is used for cluster communication on one of the cluster nodes.
Bring down the corresponding port on the switch.
Use a firewall to block the ports used by corosync.

Virtual machines only

On the virtual machine(that is a member of the cluster) disconnect the network interface from the bridge, so traffic on that interface goes nowhere. For example run the following command with the correct arguments on the virtual machine's physical host:
```
  # brctl delif <brname> <ifname>
```
Use a firewall to block the ports used by corosync.

Simulate a network failure with a firewall

Physical and Virtual machines on RHEL 5 and RHEL 6

Start `iptables` and use `iptables` to drop both incoming and outgoing traffic for the IP associated with heartbeat traffic for the particular cluster node. Below is an example script called `net_breaker.sh` that will drop both incoming and outgoing traffic for the IP associated with heartbeat traffic. *If are using a [redundant heartbeat network or RRP](/solutions/61832) then this script will need to be ran twice because there are two heartbeats.*

The script assumes that iptables is being used as the firewall solution and is running.

#!/bin/sh
# Usage: 
# ~/bin/net_breaker BreakCommCmd 192.168.1.101

set -e
if [ $1 = "BreakCommCmd" ]; then
    iptables -A INPUT -s $2 -j DROP >/dev/null 2>&1
    iptables -A OUTPUT -s $2 -j DROP >/dev/null 2>&1
    iptables -A INPUT -m pkttype --pkt-type multicast -j DROP
fi
if [ $1 = "FixCommCmd" ]; then
    iptables -F >/dev/null 2>&1
fi
exit 0

Here is an example on how to run the script where one rings used for corosync is 192.168.1.101.

# chmod 700 ~/bin/net_breaker
# ~/bin/net_breaker BreakCommCmd 192.168.1.101

Then to remove the rule run the following:

# ~/bin/net_breaker FixCommCmd 192.168.1.101

Physical and Virtual machines on RHEL 7 or later

For RHEL 7 or later use `firewalld` to simulate a network failure.

In order to effectively simulate a network block (in order to simulate a network failure) both of the following has to be done (merely removing only the high-availability service would have sporadic results):

Removal of the high-availability service in firewalld.
Add a drop rule for outbound and inbound for any udp streams where the cluster uses the same local port as remote port (corosync port: 5405).

Start firewalld if the service is not started.

# systemctl start firewalld
# systemctl status firewalld

Do note that the firewall-cmd commands below are not saving the changes made below to a file, but are modifying the running (in memory) instance of firewalld. When the host reboots (or restarts firewalld) then firewalld will load the saved rules and services before the firewall-cmd commands below was ran.

Cluster member nodes

Verify all the cluster members nodes have joined the cluster.

# corosync-quorumtool

Get the active zone that is currently being used and verify the interface that is being used for corosync is included in that list of interfaces.

# firewall-cmd --get-active-zones

Run the following command to drop the ports used for cluster communication:

# firewall-cmd --zone=<zone> --remove-service=high-availability

Add the rule to drop outbound and inbound for port 5405 for corosync. The cluster node should lost a token and will be fenced when firewalld starts dropping packets for port 5405:

# firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
# firewall-cmd --add-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'

To add the high-availability service and verify it was added then run the following command:

# firewall-cmd --zone=<zone> --add-service=high-availability
# firewall-cmd --zone=<zone> --list-all

To remove the rules above and verify they were removed then run the following command:

# firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
# firewall-cmd --direct --get-all-rules
# firewall-cmd --remove-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'
# firewall-cmd --list-rich-rules

Cluster Remote Nodes

Run the following command to drop the ports used for cluster communication:

# firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p tcp --sport=3121 -j DROP

To remove the rule above and verify the rule were removed then run the following commands:

# firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p tcp --sport=3121 -j DROP
# firewall-cmd --direct --get-all-rules

Example for Cluster Members

In this example the zone is called `public`. We do not permanently (or save) the rules, so the rules that are added below will not be included when the host reboots.

# firewall-cmd --get-active-zones
public
  interfaces: ens6 ens3
# firewall-cmd --zone=public --remove-service=high-availability
success
# firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
success
# firewall-cmd --add-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'
success

Root Cause

The service corosync on Red Hat Enterprise Linux (RHEL) 6 is designed to run in an environment where interfaces can be taken out of service but messaging can still continue. When an interface state changes (up/down), occasionally virtual synchrony guarantees are violated for other nodes in the cluster. This could result in a non-operable cluster. More specifically corosync rebinds to the localhost network so it can continue transmitting messages. corosync will ignore the fact that the interface went down and will filter messages up/down the stack, while totem attempts to rebind to the interface periodically.

The service corosync behaves in this manner so that virtual synchrony is maintained (all cluster nodes that were connected to the ifdowned node would see the node as disappeared) and the cluster node that was isolated with ifdown would behave as though the network were still operational on its local nic, but not be able to see other cluster nodes.

Using ifdown or ifup on a network interface that is used for cluster communication is not a valid way of testing how the cluster behaves in the event there is a network failure(and unexpected results will occur using this method).

Openais in RHEL 5 Clustering behaves in a simliar fashion.

For more information see this article on Content from github.com is not included.Corosync and ifdown on active network interface.

SBR

Clusterha

Product(s)

Red Hat Enterprise Linux

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.