QDisk heuristic using ping is timing out when there are no other noticeable issues with the network in a RHEL cluster

Solution Verified - Updated

Environment

  • Red Hat Cluster Suite 4+
  • Red Hat Enterprise Linux Server 5 (with the High Availability Add on)
  • Red Hat Enterprise Linux Server 6 (with the High Availability Add on)
  • A cluster configuration using QDisk and a ping heuristic
    • Heuristic does not use the -w option on ping

Issue

  • We have a cluster using QDisk with a heuristic pinging the default gateway. This heuristic is timing out intermittently, but there are no other signs of issues on that network (such as token losses) at the time of the problem.
  • What time out value does a heuristic use? The amount of time reported in the logs when it times out does not match the heuristic's tko*interval value.
  • I have a ping heuristic of the following form and occasionally I see a heuristic timeout in /var/log/messages, followed by the cluster node being evicted and fenced::
<heuristic interval="2" program="ping -c1 -t1 192.168.2.1" score="1" tko="3"/>

Oct  4 00:15:12 node1 qdiskd[6854]: <info> Heuristic: 'ping -c1 -t1 192.168.2.1' DOWN - Exceeded timeout of 9 seconds
Oct  4 00:15:12 node1 qdiskd[6854]: <notice> Score insufficient for master operation (0/1; required=1); downgrading
  • Cluster services failover and node gets rebooted unexpectedly in two node cluster with qdisk which has heuristic configured. Found some qdiskd messages logged, what's causing GFS2 cluster to crash?

Resolution

  • Configure the heuristic to use ping's deadline (-w) option to ensure that if a ping response is not received in a reasonable amount of time, it will fail rather than time out.
  • Use a heuristic tko greater than 1 (the default) so that in the event of a ping failure, it will be retried multiple times before determining the heuristic is down.
  • An example of <quorumd> heuristic with the ping command is as follows:
  <heuristic interval="2" program="ping -c1 -w1 192.168.2.1" score="1" tko="3"/>

Root Cause

  • When using a QDisk heuristic, there are two ways in which it can fail:

    • The specified program returns a non-zero value in tko consecutive attempts, or
    • The specified program does not return within the amount of time determined by quorumd's interval*(tko-1). 
  • The amount of time it takes a heuristic to fail vs the amount of time it takes it to timeout are based on different factors. The former is determined by the heuristic's tko*interval, while the latter is based on quorumd's interval*(tko-1).

  • When using ping, occasionally a response may not be received due to network factors (congestion, drops, etc).  Often when this occurs, it is an intermittent problem and subsequent pings will succeed. However, if you are pinging without the use of a timeout or deadline (-w or -W), then it will wait indefinitely for a response.

  • In the context of QDisk, if only a single ping is sent (-c1) and no response is received, then it must wait until ping exits with a success/failure code, which in most cases will take longer than quorumd's interval*(tko-1). So, instead of seeing the heuristic be tried tko times and giving it a chance to succeed on subsequent attempts, only a single attempt is made which results in a timeout. By adding the deadline (-w) option to the ping, it ensures that even if no response is received, that ping will fail quickly rather than wait indefinitely for a response.  By failing quickly, you give QDisk the opportunity to try again, hopefully succeeding.

Diagnostic Steps

  • Determine if -w is used on the ping heuristic.
  • Determine if the heuristic is timing out or failing. For example, if its timing out qdiskd will report that it "Exceeded timeout":
Oct  4 00:15:12 node1 qdiskd[6854]: <info> Heuristic: 'ping -c1 -t1 192.168.2.1' DOWN - Exceeded timeout of 9 seconds
  • Whereas if the heuristic actually fails:
Dec 20 15:24:25 node1 qdiskd[8279]: <info> Heuristic: 'ping -c 1 -w 1 192.168.2.1' DOWN (10/10)
  • If its failing rather than timing out, it means that the ping is returning but with an error code (i.e., -w will not solve the problem).
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.