Understanding EgressIP Failover timings in OpenShift 4 for SDN and OVN

Updated 24 Mar 2026

Overview:

This article discusses the expected behavior and handling of OpenShift EgressIP services on both OpenShift-SDN and OpenShift-OVN-kubernetes Cluster Network Interfaces. Primarily it seeks to address failover timings and behavior as the egressIP transfers between nodes, and what should be expected from your applications utilizing the service networks to communicate with external hosts.

The contents of this article are written for OpenShift versions 4.1 --> 4.15. Future versions of the platform may have different values

OpenShift-SDN Failover logic and handling:

OpenShift SDN takes approximately 15s (depending on cluster density/workload) to handle the loss of an egressIP node and re-assign the IP address to a secondary host. High availability deployments (multiple EgressIPs assigned to the same namespace) can reduce the impact by providing an immediately available secondary egress host for new requests to route through, but traffic has an equal chance of using the old/invalid host during the failover timeframe, which may impact new calls using the wrong route until the iptable rulesets are updated via the local flow outlined below.
Here is where we define polling for liveness on an egressIP - set to 5s.
Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/sdn/blob/release-4.12/pkg/network/common/egressip.go#L34C1-L45C2

	// DefaultPollInterval default poll interval used for egress node reachability check
	DefaultPollInterval = 5 * time.Second

If we don't get a liveness probe back, (or if the time exceeds 5s on a reply) we retry the call; as outlined here:
Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/sdn/blob/release-4.12/pkg/network/master/egressip.go#L236-L283
maxRetries=2 is defined in the same git entry ^
Once it is determined that an egress IP needs to be re-allocated (failed call 2x), we take (approximately) 2s to determine that an egressIP needs to be re-allocated, and add it to the scheduler:
Content from github.com is not included.Content from github.com is not included.https://github.com/openshift/sdn/blob/release-4.12/pkg/network/master/egressip.go#L153-L181
Therefore it can take about 12s (2x5s timeout period + 2x1s default timeout for re-allocation) to notice the re-balance need, plus the time it takes to actually exert the configuration change which will depend on cluster load itself. This leads to about a 15-20s failover time-frame when considering moving 1 Egress IP to a new host from time it goes down, to time it becomes available again on a neighboring node.

=====================================================

OVN-Kubernetes Failover logic:

OpenShift-OVN-Kubernetes takes approximately 12-15s to complete an egressIP failover to a new host node. We define a 5s window to confirm a timeout, check it 2x (10s) and then have a re-assignment process that takes about two seconds + the overhead on actually completing the reconcile to add the egressIP to the new node + update the logical flows.
It is possible to tune the timeout value in OVN that is not present in SDN:

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      egressIPConfig: 
        reachabilityTotalTimeoutSeconds: 5 ##<---- tunable
      gatewayConfig:
        routingViaHost: false
      genevePort: 6081

*The egressIPConfig holds the configurations for the options of the EgressIP object. Changing these configurations allows you to extend the EgressIP object.

**The value for reachabilityTotalTimeoutSeconds accepts integer values from 0 to 60. A value of 0 disables the reachability check of the egressIP node. Values of 1 to 60 correspond to the duration in seconds between probes sending the reachability check for the node.

Updating this value can increase the speed at which the failover timeframe is completed. This may prevent undue time waiting on high availability clusters. Note: Reducing the size of this timeout can have consequences resulting in frequent egressIP rollover if set too low in situations where high network traffic is expected that may interfere with GRPC calls response time. Tuning/testing is recommended before implementing on a production cluster.
Here is where we define the 5s timeout constant:
Content from github.com is not included.Content from github.com is not included.https://github.com/ovn-org/ovn-kubernetes/blob/5359e7d7f872058b6e5bf884c9f19d1922451f29/go-controller/pkg/clustermanager/egressip_controller.go#L37-L40

const (
	egressIPReachabilityCheckInterval = 5 * time.Second
)

We define a reachability check here:
Content from github.com is not included.Content from github.com is not included.https://github.com/ovn-org/ovn-kubernetes/blob/5359e7d7f872058b6e5bf884c9f19d1922451f29/go-controller/pkg/clustermanager/egressip_controller.go#L73-L94

// Blantant copy from: https://github.com/openshift/sdn/blob/master/pkg/network/common/egressip.go#L499-L505
// Ping a node and return whether or not we think it is online. We do this by trying to
// open a TCP connection to the "discard" service (port 9); if the node is offline, the
// attempt will either time out with no response, or else return "no route to host" (and
// we will return false). If the node is online then we presumably will get a "connection
// refused" error; but the code below assumes that anything other than timeout or "no
// route" indicates that the node is online.
func (e *egressIPDial) dial(ip net.IP, timeout time.Duration) bool {
	conn, err := net.DialTimeout("tcp", net.JoinHostPort(ip.String(), "9"), timeout)
	if conn != nil {
		conn.Close()
	}
	if opErr, ok := err.(*net.OpError); ok {
		if opErr.Timeout() {
			return false
		}
		if sysErr, ok := opErr.Err.(*os.SyscallError); ok && sysErr.Err == syscall.EHOSTUNREACH {
			return false
		}
	}
	return true
}

Content from github.com is not included.Here is where we define a retry which causes that check to be called again:

// Probe checks the health of egress ip service using a connected gRPC session.
func (ehc *egressIPHealthClient) Probe(dialCtx context.Context) bool {
	if ehc.conn == nil {
		// should never happen
		klog.Warningf("Unexpected probing before connecting %s", ehc.nodeName)
		return false
	}

	response, err := NewHealthClient(ehc.conn).Check(dialCtx, &HealthCheckRequest{Service: serviceEgressIPNode})
	if err != nil {
		// check failed. What we will return here will depend on ehc.probeFailed. If this is the first failure,
		// let's tolerate it to account for cases where session went down and we just need it re-established.
		// Otherwise, declare it failed.
		klog.V(5).Infof("Probe failed %s (%s): %s", ehc.nodeName, ehc.nodeAddr, err)
		ehc.Disconnect()
		prevProbeFailed := ehc.probeFailed
		ehc.probeFailed = true
		return !prevProbeFailed
	}

The Failover itself takes somewhere around 1-2s to update the flow rules on nodes, (depending on cluster size, this may be longer or faster to complete), but it ends up being somewhere around 12-15s for failover as a result; with the option to reduce that timeframe by reducing our timeout period during those above checks to something smaller.
Here is a relevant bug opened to try and reduce timeout logic further:
This content is not included.This content is not included.https://issues.redhat.com/browse/SDN-4470

=======================================================================

Understanding High-Availability with multiple EgressIP:

It is advised for high availability requirements that we should always be considering the use of multiple addresses to reduce the impact when an Egress IP goes offline.
Usage of a single EgressIP in production is not advised, as the failover timeframe will lead to a downtime window roughly equivalent to the length of time it takes to failover to a neighbor node.
A namespace configured with 2 or more Egress IP addresses will utilize all available egress addresses roughly equally. Each new connection will be distributed via the available routed egress hosts at roughly the same rate.
The selection process for both SDN and OVN in choosing between available egressIP's is approximately random and is based on similar selection logic that is used for service distribution to backends. For more information on this process, see this article that covers selection methods for SDN and OVN
During an outage/failover event, the newly unavailable EgressIP is still available as a host to the pod until the failover timeframe completes, which leads to a distribution of traffic that has a roughly equal chance of selecting the available online host, or the downed egressIP until failover completes. As a result, You should expect to see that new connections are subject to a partial interruption of traffic as there is a chance the pod will query via the now unavailable egress IP address while it is failing over to a new neighbor host node. See testing section below for an example on what this will look like from pod perspective.
You may reduce wait time in such scenarios by reducing your timeout on your applications + increasing your retry sensitivity to force a new connection attempt immediately when the connection is broken if an egressIP address goes offline. Objectively, if the pod is able to send a new connection request near immediately on a no-route-to-host or connection timeout, it should fire a retry and have a chance to select the available EgressIP that is still online for the next query, reducing downtime between calls.
On OVN, reducing the value for reachabilityTotalTimeoutSeconds can speed up the rate at which your nodes are determined to be down, and reduce the failover timeframe even further which may help improve cluster performance.

=======================================================================

Testing EgressIP Failover times

You may deploy an egressIP test pod that will call an upstream resource to observe the traffic spread during the failover timeframe, using an egress pod designed to curl an external target via EgressIP, and with specific curl options to ensure that we do not wait indefinitely for no reply, which can cause the egress failover to appear to take several minutes (but really we're waiting for a failed curl to time out).

apiVersion: v1
kind: Pod
metadata:
  name: egress-pod
  annotations:
spec:
  nodeSelector:
    kubernetes.io/hostname: <node-name>
  containers: 
  - command: ['bash', '-c', 'while true; do  echo -n $(date) -- ; curl --connect-timeout 1 -m 2 -k -w "response: %{response_code}\n" -o /dev/null -s <address>:[port]; done']
    image: registry.redhat.io/rhel7/rhel-tools
    imagePullPolicy: IfNotPresent
    name: test-egress-pod
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File

Highlighting the curl syntax:

curl --connect-timeout 1 -m 2 -k -w "response: %{response_code}\n" -o /dev/null -s <address>:[port]

#options:
--connect-timeout: maximum time allowable for a syn/ack + tls handshake to complete set to 1s
-m: maximum time for the whole transaction to complete set to 2s (in our test it's assumed this is sufficient time to return a 200)
-w "response: %{response_code}\n" : we're only going to be returning response: 200 (assuming a 200 reply from our target)

Output is expected to look similar to the below. Highlighting a failover event on an OpenShift-SDN cluster with two egressIP addresses that lasts approximately 12s. In our example below, we see the curl results from the egress pod using two egressIPs go from stable 200 continuous replies to an intermittent failure as the pod tries both egress pathways and fails to connect on one of them while it's unavailable, and succeeds on the stable/remaining host until the rollover is completed:

Tue Mar 19 17:13:05 UTC 2024 -- response: 200
Tue Mar 19 17:13:05 UTC 2024 -- response: 200
Tue Mar 19 17:13:06 UTC 2024 -- response: 200
Tue Mar 19 17:13:06 UTC 2024 -- response: 200 
Tue Mar 19 17:13:06 UTC 2024 -- response: 000 #failover start - observe inconsistent response pattern begin:
Tue Mar 19 17:13:07 UTC 2024 -- response: 000
Tue Mar 19 17:13:08 UTC 2024 -- response: 200
Tue Mar 19 17:13:08 UTC 2024 -- response: 000
Tue Mar 19 17:13:09 UTC 2024 -- response: 200
Tue Mar 19 17:13:09 UTC 2024 -- response: 000
Tue Mar 19 17:13:10 UTC 2024 -- response: 200
Tue Mar 19 17:13:10 UTC 2024 -- response: 200
Tue Mar 19 17:13:10 UTC 2024 -- response: 200
Tue Mar 19 17:13:10 UTC 2024 -- response: 000
Tue Mar 19 17:13:11 UTC 2024 -- response: 000
Tue Mar 19 17:13:12 UTC 2024 -- response: 000
Tue Mar 19 17:13:13 UTC 2024 -- response: 000
Tue Mar 19 17:13:14 UTC 2024 -- response: 000
Tue Mar 19 17:13:15 UTC 2024 -- response: 200
Tue Mar 19 17:13:15 UTC 2024 -- response: 200
Tue Mar 19 17:13:15 UTC 2024 -- response: 000
Tue Mar 19 17:13:16 UTC 2024 -- response: 000
Tue Mar 19 17:13:17 UTC 2024 -- response: 000
Tue Mar 19 17:13:18 UTC 2024 -- response: 200
Tue Mar 19 17:13:18 UTC 2024 -- response: 200
Tue Mar 19 17:13:18 UTC 2024 -- response: 200
Tue Mar 19 17:13:18 UTC 2024 -- response: 000
Tue Mar 19 17:13:19 UTC 2024 -- response: 200 #failover completed - observe response stability resume
Tue Mar 19 17:13:19 UTC 2024 -- response: 200
Tue Mar 19 17:13:19 UTC 2024 -- response: 200
Tue Mar 19 17:13:20 UTC 2024 -- response: 200
Tue Mar 19 17:13:20 UTC 2024 -- response: 200
Tue Mar 19 17:13:20 UTC 2024 -- response: 200

You may further examine EgressIP handling and validation of flows for each specific CNI in the following support article: How to validate that Egress IP is working/assigned properly in OpenShift 4.x
See also this related KCS article that discusses long egressIP failover times: It takes 2 minutes 15 seconds to failover from primary egressIP to secondary egress IP which pertains to CLIENT RETRY logic - we will artificially extend the downtime if the client waits the full 2 minutes for curl to hit default timeout value (2m), which makes egressIP failover appear to take much longer than it actually does.

SBR