Is there a way to limit multipath failover times in order to avoid Oracle RAC cluster evictions?

Solution Verified - Updated 2 Jul 2024

Environment

Red Hat Enterprise Linux 6
Red Hat Enterprise Linux 7
Red Hat Enterprise Linux 8
Oracle RAC
Fibre Channel SAN storage

Exceptions:

For iSCSI based systems, see: "How can I improve the failover time of a faulty path when using device-mapper-multipath over iSCSI in Red Hat Enterprise Linux?"

Issue

Multipath takes too long to react during SAN failures, exceeding Oracle RAC cluster timeouts and triggering evictions.

The voting disk timeout is 200s
The network heartbeat is 30s (css_misscount)
The SDTO is 27s (short disk timeout)

The SDTO is not public-ally known

Network Ping	                        Disk Ping	                                                   Reboot
Completes within misscount seconds	Completes within Misscount seconds	                                N
Completes within Misscount seconds	Takes more than misscount seconds but less than Disktimeout seconds	N
Completes within Misscount seconds	Takes more than Disktimeout seconds	                                Y
Takes more than  Misscount Seconds	Completes within Misscount seconds	                                Y

These messages show the SDTO
[ CSSD][xxxxxxx]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
More than disk timeout of 27000 after the last NHB (network heartbeat)

Resolution

Note: The following recommendations are provided as an initial configuration for tuning Red Hat Enterprise Linux components in combination with Oracle RAC clusters. The combination of the settings documented below has been found to work in a series of cases, however, we advise testing these configurations as fine-tuning may be required to match specific environments.

The following settings in /etc/multipath.conf can limit the time required to detect and handle several types of path failures (mainly the ones detected by multipath checkers and the ones detected at the FC layer):
```
 defaults {
         user_friendly_names yes
         find_multipaths     yes     
         max_fds            8192    
         checker_timeout       5       
         polling_interval      5         # FN.1   
         no_path_retry         2       
         dev_loss_tmo         10         # FN.2 >= polling_interval * no_path_retry
         fast_io_fail_tmo      5         
 }
```
- Footnotes
  - FN.1 : Note: In environments with large numbers (e.g. thousands) of LUNs and paths, polling_interval needs to be increased (e.g. to 10). In case of increasing polling_interval, the minimum value of dev_loss_tmo will also change. More details can be found in the next note.
  - FN.2 : Note: In recent versions of device-mapper-multipath, the minimum possible value of dev_loss_tmo is no_path_retry * polling_interval. This is described in more detail in the article: Unable to set custom 'dev_loss_tmo' value in RHEL7 . The value of 'dev_loss_tmo' is a trade-off between reducing the overhead of continuously monitoring failed paths and the load of re-detecting deleted paths once they re-appear. There are situations (usually when there is a very large number of paths and re-detecting deleted paths can cause a non-negligible load spike) in which it is better to have an increased dev_loss_tmo in order to avoid deleting and re-detecting paths. The dev_loss_tmo value should never be set to a value equal to or greater than the device's io_timeout value. Testing will be required in order to conclude which is the most suitable value.
- After making changes in multipath.conf, a reload of the multipathd service is needed to make the changes effective.
```
  # service multipathd reload
```
- If multipath is included in the initramfs, then Rebuilding the initramfs and rebooting with the new initramfs is also needed. The reboot is required to make sure that the changes are successfully applied at boot time. For the current operating state, reloading the multipathd service is enough.
```
  # cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.$(date +%m-%d-%H%M%S).bak   <--- first create a backup of the existing initamfs
  # dracut -f -v          <---- this is for rebuilding the initramfs
```
- The parameters can be verified using:
  - fast_io_fail_tmo
```
  # for f in /sys/class/fc_remote_ports/rport-*/fast_io_fail_tmo; do d=$(dirname $f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
```
  - dev_loss_tmo
```
  # for f in /sys/class/fc_remote_ports/rport-*/dev_loss_tmo; do d=$(dirname$f); echo $(basename $d):$(cat $d/node_name):$(cat $f); done
```
- More details on these parameters can be found in:
  - The man page of multipath.conf(5) - man multipath.conf
  - Failed paths to scsi devices are deleted from the system
  - Multipath is not detecting path failures fast enough which results in application failure and system reboots
The following settings limit the time required for the scsi layer to timeout and to complete error handling:
- scsi command timeout: This can be changed using a udev rule. It is not recommended to have this lower than 3 seconds in any circumstances (5 for iscsi) otherwise error handling can start getting triggered even when it is not necessary. The default value is 30 in recent Red Hat Enterprise Linux releases. A value of 20 can be a starting point. It is good to confirm with the storage vendor that the storage array can handle values lower than the default. This can be set by creating (or modifying it if it already exists) the file /etc/udev/rules.d/90-scsitimeout.rules, with the following content:
```
  # cat 90-scsitimeout.rules           <--- create this file or modify existing
  ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|14", \
  RUN+="/bin/sh -c 'echo 20 > /sys$$DEVPATH/timeout'"
```
- Error handling timeouts: These are controlled by two parameters: eh_timeout and eh_deadline. It is recommended to leave eh_timeout at the default value of 10. eh_deadline is not set by default. A starting value can be 15. It can be set using the following loop:
```
  # for i in $(ls /sys/class/fc_host/); do echo 15 > /sys/class/scsi_host/$i/eh_deadline; done
```
  Alternatively, a custom udev rule can be used (in order to have the settings applied during boot), as described in the article: How to set eh_deadline and eh_timeout persistently, using a udev rule
- The change can be confirmed with the following loop:
```
  # for i in $(ls /sys/class/fc_host/); do echo $i ; cat  /sys/class/scsi_host/$i/eh_deadline; done   <-- number of hosts would vary in your environment.
  host0
  15
  host1
  15
```
- More details on the timeouts mentioned in this section, can be found in the article: Limiting path failover time for SCSI devices
Tuning at application layer:

Note: The following settings are related to a 3rd party product (Oracle RAC). You will need to reach out to the 3rd party vendor (Oracle) for more details.
- css misscount needs to be increased (e.g. to 90s). Oracle RAC defaults to a 28s miss-count(network heartbeat) which is too tight compared to the timeouts discussed in points 1. and 2. When a server is in recovery and the Oacle binaries are on the SAN, while multipath is blocked in recovery this can exceed the Oracle miss-count. The network heartbeat daemons will not run while the filesystem is blocked and this can cause evictions. Setting the Oracle miss-count to 90s in order to align with timeouts in lower layers is a common setting.
- _asm_hbeatiowait needs to be increased (e.g. to 120s). The default value of 15 would cause ASM heartbeat timeout on ASM disk during path failover or the scsi error handling process. This would cause an eviction of the cluster node before multipath would be able to react.
- Ensure that all the package versions in cluster nodes are at same version.

Also, see the following on additional details for shortening timeout failover to surviving paths in a fibre channel environment:

To lengthen timeout failure to help prevent filesystems entering read-only mode:

How can I prevent my filesystems on top of multipath devices from entering read-only mode after storage errors?

Root Cause

Default timeouts configured at the application layer are too low, combined with very relaxed default timeouts at lower layers within the Operating System. As a result, the application (Oracle RAC) evicts nodes before the OS completes error handling procedures (sometimes even before error handling starts), while the OS is still waiting for timeouts to expire, in order to take action.

Diagnostic Steps

Such events frequently leave few traces in the logs of the evicted node because the eviction happens too soon, before the OS detects the problem.
If the underlying problem that caused the eviction affected also nodes that were not evicted, then logs in the nodes that remained online can provide additional details
In some cases, the event is detected also by the evicted node, but the corresponding messages are still in memory at the moment of the eviction and are lost due to the reboot. In such cases, a serial console can be of help. If it is required to enable serial console logging, it is recommended to set the serial console log level to 4. This avoids a known issue involving excessive logging that leads to a system hang. This can be done by:
```
  # dmesg -n 4
```
Or by setting it persistently in grub.conf using the loglevel directive.
Cluster logs can reveal why the cluster decided to evict the node.
Reviewing the timeouts at the OS level as described in the "Resolution" section and comparing them to the timeouts in the application can reveal the mismatch.
If none of the steps above provides any valuable information, then it is possible to trigger a crash instead of powercycling the node during a cluster eviction as described in: How to capture kernel crash dump (vmcore) upon Oracle RAC 11gR2 or Oracle RAC 12 node eviction ? and How to capture crash dump (vmcore) upon Oracle RAC 11g R1 (or earlier) node eviction ?. The aim is to generate a vmcore in order to understand the system state at the moment of the eviction.
Additional (performance related) tuning that can affect the functionality of the cluster and the database can be found in: Tuning Red Hat Enterprise Linux for Oracle and Oracle RAC performance .

SBR

Storage

Product(s)

Red Hat Enterprise Linux

Components

device mapper multipath

Category

Configure

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.