Understanding EgressIP Failover timings in OpenShift 4 for SDN and OVN

Updated

Overview:

This article discusses the expected behavior and handling of OpenShift EgressIP services on both OpenShift-SDN and OpenShift-OVN-kubernetes Cluster Network Interfaces. Primarily it seeks to address failover timings and behavior as the egressIP transfers between nodes, and what should be expected from your applications utilizing the service networks to communicate with external hosts.

The contents of this article are written for OpenShift versions 4.1 --> 4.15. Future versions of the platform may have different values

Table of contents:

OpenShift-SDN Failover logic and handling:

	// DefaultPollInterval default poll interval used for egress node reachability check
	DefaultPollInterval = 5 * time.Second

=====================================================

OVN-Kubernetes Failover logic:

  • OpenShift-OVN-Kubernetes takes approximately 12-15s to complete an egressIP failover to a new host node. We define a 5s window to confirm a timeout, check it 2x (10s) and then have a re-assignment process that takes about two seconds + the overhead on actually completing the reconcile to add the egressIP to the new node + update the logical flows.

  • It is possible to tune the timeout value in OVN that is not present in SDN:

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      egressIPConfig: 
        reachabilityTotalTimeoutSeconds: 5 ##<---- tunable
      gatewayConfig:
        routingViaHost: false
      genevePort: 6081

*The egressIPConfig holds the configurations for the options of the EgressIP object. Changing these configurations allows you to extend the EgressIP object.

**The value for reachabilityTotalTimeoutSeconds accepts integer values from 0 to 60. A value of 0 disables the reachability check of the egressIP node. Values of 1 to 60 correspond to the duration in seconds between probes sending the reachability check for the node.

const (
	egressIPReachabilityCheckInterval = 5 * time.Second
)
// Blantant copy from: https://github.com/openshift/sdn/blob/master/pkg/network/common/egressip.go#L499-L505
// Ping a node and return whether or not we think it is online. We do this by trying to
// open a TCP connection to the "discard" service (port 9); if the node is offline, the
// attempt will either time out with no response, or else return "no route to host" (and
// we will return false). If the node is online then we presumably will get a "connection
// refused" error; but the code below assumes that anything other than timeout or "no
// route" indicates that the node is online.
func (e *egressIPDial) dial(ip net.IP, timeout time.Duration) bool {
	conn, err := net.DialTimeout("tcp", net.JoinHostPort(ip.String(), "9"), timeout)
	if conn != nil {
		conn.Close()
	}
	if opErr, ok := err.(*net.OpError); ok {
		if opErr.Timeout() {
			return false
		}
		if sysErr, ok := opErr.Err.(*os.SyscallError); ok && sysErr.Err == syscall.EHOSTUNREACH {
			return false
		}
	}
	return true
}
// Probe checks the health of egress ip service using a connected gRPC session.
func (ehc *egressIPHealthClient) Probe(dialCtx context.Context) bool {
	if ehc.conn == nil {
		// should never happen
		klog.Warningf("Unexpected probing before connecting %s", ehc.nodeName)
		return false
	}

	response, err := NewHealthClient(ehc.conn).Check(dialCtx, &HealthCheckRequest{Service: serviceEgressIPNode})
	if err != nil {
		// check failed. What we will return here will depend on ehc.probeFailed. If this is the first failure,
		// let's tolerate it to account for cases where session went down and we just need it re-established.
		// Otherwise, declare it failed.
		klog.V(5).Infof("Probe failed %s (%s): %s", ehc.nodeName, ehc.nodeAddr, err)
		ehc.Disconnect()
		prevProbeFailed := ehc.probeFailed
		ehc.probeFailed = true
		return !prevProbeFailed
	}
  • The Failover itself takes somewhere around 1-2s to update the flow rules on nodes, (depending on cluster size, this may be longer or faster to complete), but it ends up being somewhere around 12-15s for failover as a result; with the option to reduce that timeframe by reducing our timeout period during those above checks to something smaller.

  • Here is a relevant bug opened to try and reduce timeout logic further:
    This content is not included.This content is not included.https://issues.redhat.com/browse/SDN-4470

=======================================================================

Understanding High-Availability with multiple EgressIP:

  • It is advised for high availability requirements that we should always be considering the use of multiple addresses to reduce the impact when an Egress IP goes offline.
  • Usage of a single EgressIP in production is not advised, as the failover timeframe will lead to a downtime window roughly equivalent to the length of time it takes to failover to a neighbor node.
  • A namespace configured with 2 or more Egress IP addresses will utilize all available egress addresses roughly equally. Each new connection will be distributed via the available routed egress hosts at roughly the same rate.
  • The selection process for both SDN and OVN in choosing between available egressIP's is approximately random and is based on similar selection logic that is used for service distribution to backends. For more information on this process, see this article that covers selection methods for SDN and OVN
  • During an outage/failover event, the newly unavailable EgressIP is still available as a host to the pod until the failover timeframe completes, which leads to a distribution of traffic that has a roughly equal chance of selecting the available online host, or the downed egressIP until failover completes. As a result, You should expect to see that new connections are subject to a partial interruption of traffic as there is a chance the pod will query via the now unavailable egress IP address while it is failing over to a new neighbor host node. See testing section below for an example on what this will look like from pod perspective.
  • You may reduce wait time in such scenarios by reducing your timeout on your applications + increasing your retry sensitivity to force a new connection attempt immediately when the connection is broken if an egressIP address goes offline. Objectively, if the pod is able to send a new connection request near immediately on a no-route-to-host or connection timeout, it should fire a retry and have a chance to select the available EgressIP that is still online for the next query, reducing downtime between calls.
  • On OVN, reducing the value for reachabilityTotalTimeoutSeconds can speed up the rate at which your nodes are determined to be down, and reduce the failover timeframe even further which may help improve cluster performance.

=======================================================================

Testing EgressIP Failover times

  • You may deploy an egressIP test pod that will call an upstream resource to observe the traffic spread during the failover timeframe, using an egress pod designed to curl an external target via EgressIP, and with specific curl options to ensure that we do not wait indefinitely for no reply, which can cause the egress failover to appear to take several minutes (but really we're waiting for a failed curl to time out).
apiVersion: v1
kind: Pod
metadata:
  name: egress-pod
  annotations:
spec:
  nodeSelector:
    kubernetes.io/hostname: <node-name>
  containers: 
  - command: ['bash', '-c', 'while true; do  echo -n $(date) -- ; curl --connect-timeout 1 -m 2 -k -w "response: %{response_code}\n" -o /dev/null -s <address>:[port]; done']
    image: registry.redhat.io/rhel7/rhel-tools
    imagePullPolicy: IfNotPresent
    name: test-egress-pod
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  • Highlighting the curl syntax:
curl --connect-timeout 1 -m 2 -k -w "response: %{response_code}\n" -o /dev/null -s <address>:[port]

#options:
--connect-timeout: maximum time allowable for a syn/ack + tls handshake to complete set to 1s
-m: maximum time for the whole transaction to complete set to 2s (in our test it's assumed this is sufficient time to return a 200)
-w "response: %{response_code}\n" : we're only going to be returning response: 200 (assuming a 200 reply from our target)
  • Output is expected to look similar to the below. Highlighting a failover event on an OpenShift-SDN cluster with two egressIP addresses that lasts approximately 12s. In our example below, we see the curl results from the egress pod using two egressIPs go from stable 200 continuous replies to an intermittent failure as the pod tries both egress pathways and fails to connect on one of them while it's unavailable, and succeeds on the stable/remaining host until the rollover is completed:
Tue Mar 19 17:13:05 UTC 2024 -- response: 200
Tue Mar 19 17:13:05 UTC 2024 -- response: 200
Tue Mar 19 17:13:06 UTC 2024 -- response: 200
Tue Mar 19 17:13:06 UTC 2024 -- response: 200 
Tue Mar 19 17:13:06 UTC 2024 -- response: 000 #failover start - observe inconsistent response pattern begin:
Tue Mar 19 17:13:07 UTC 2024 -- response: 000
Tue Mar 19 17:13:08 UTC 2024 -- response: 200
Tue Mar 19 17:13:08 UTC 2024 -- response: 000
Tue Mar 19 17:13:09 UTC 2024 -- response: 200
Tue Mar 19 17:13:09 UTC 2024 -- response: 000
Tue Mar 19 17:13:10 UTC 2024 -- response: 200
Tue Mar 19 17:13:10 UTC 2024 -- response: 200
Tue Mar 19 17:13:10 UTC 2024 -- response: 200
Tue Mar 19 17:13:10 UTC 2024 -- response: 000
Tue Mar 19 17:13:11 UTC 2024 -- response: 000
Tue Mar 19 17:13:12 UTC 2024 -- response: 000
Tue Mar 19 17:13:13 UTC 2024 -- response: 000
Tue Mar 19 17:13:14 UTC 2024 -- response: 000
Tue Mar 19 17:13:15 UTC 2024 -- response: 200
Tue Mar 19 17:13:15 UTC 2024 -- response: 200
Tue Mar 19 17:13:15 UTC 2024 -- response: 000
Tue Mar 19 17:13:16 UTC 2024 -- response: 000
Tue Mar 19 17:13:17 UTC 2024 -- response: 000
Tue Mar 19 17:13:18 UTC 2024 -- response: 200
Tue Mar 19 17:13:18 UTC 2024 -- response: 200
Tue Mar 19 17:13:18 UTC 2024 -- response: 200
Tue Mar 19 17:13:18 UTC 2024 -- response: 000
Tue Mar 19 17:13:19 UTC 2024 -- response: 200 #failover completed - observe response stability resume
Tue Mar 19 17:13:19 UTC 2024 -- response: 200
Tue Mar 19 17:13:19 UTC 2024 -- response: 200
Tue Mar 19 17:13:20 UTC 2024 -- response: 200
Tue Mar 19 17:13:20 UTC 2024 -- response: 200
Tue Mar 19 17:13:20 UTC 2024 -- response: 200
Category
Article Type