Receiving intermittent 503 HTTP Response codes from RHOCP 4 routes

Solution Verified - Updated 23 Apr 2026

Environment

Red Hat Openshift Container Platform (RHOCP) 4
HAProxy 2
IngressController

Issue

Why am I receiving HTTP 503 responses when traffic goes to my application pod running in Openshift Container Platform?
How can I tell if the HTTP 503 response code is derived from Openshift Container Platform?
What does a 503 HTTP response code mean?
Is there a document that shows why varying 503's may occur from the Openshift Container Platform Ingress Controllers?

Resolution

You can determine if the 503 is derived from the Openshift Ingress Controllers hosting haproxy in 2 ways:

By default Openshift sets up haproxy to have a default 503 response body. The response body can be seen below:

Application is not available

The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

Possible reasons you are seeing this page:

    The host doesn't exist. Make sure the hostname was typed correctly and that a route matching this hostname exists.
    The host exists, but doesn't have a matching path. Check if the URL path was typed correctly and that the route was created using the desired path.
    Route and path matches, but all pods are down. Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.

Note: This response body can be customized. See the latest Documentation.

Enable haproxy access logging as specified in the Documentation.
Note: If the 503 derives from a backend application pod that terminates TLS, you will likely not be able to see the 503 in the access logs if the 503 does not derive from Openshift, but instead a backing application.

If it is determined that the Openshift Ingress controller is the source of the 503, there are numerous reasons why this may occur - see Root Cause section. Below are potential solutions based on root cause:

Potential Solutions to Common Issues with HTTP Response Code 503 from Openshift's Default Ingress Controllers

Host does not exist

If URL was entered in correctly from client side, ensure route exists (via diagnostic section), and if it does not, create it.

Host exists but does not have a matching path

If URL and path was entered correctly from client side, ensure that the path was added to the route.

Backend pods are down

Check health of backend pods, this can be done by looking at pod health, and project events.
If backend pods are showing intermittent failures, inspect previous pod logs to understand why the pod is failing:

$ oc logs <pod> -p -n <namespace>

Backend application pods can fail for many reasons:
- Issue is often with the application and mistaken for a Openshift issue
- Check if application is exceeding a set threshold like cpu/memory limits

Router pods are failing to accept traffic

Horizontally scale router pods.
Look into This page is not included, but the link has been rewritten to point to the nearest parent document.ingress sharding to dedicate routers to routes expecting high amounts of traffic.
If resource is okay, yet haproxy is being maxxed out, look into increasing haproxy thread count (See Documentation).

Passthrough connection which terminates TLS at haproxy backend has no SNI

Inspect client traffic to understand why no SNI is configured on TLS client hello.
If SNI is configured via the client application, you may want to look if there is another component in the middle that is removing SNI.

tlsInspectDelay tuning may need to be implemented

If it is determined that there is delay in between initial TCP handshake and TLS client hello, you can tune the tlsInspectDelay value. This page is not included, but the link has been rewritten to point to the nearest parent document.See documentation here. This value can be defined by editing the default ingresscontroller object in the namespace openshift-ingress-operator and defining the spec.tuningOptions.tlsInspectDelay value to something longer than the default 5s.
Note: Long delays could be problematic with DoS attacks, so it is suggested to tune accordingly, or troubleshoot to find the bottleneck in your infrastructure.

Root Cause

HTTP Response Code 503 is a server side response code indicating 'Service Unavailable'. Typically it means that the server is not able to handle the request because it is down or it is overloaded. If the issue is specific to the Openshift Ingress controllers, then it is likely that the containerized haproxy process (runs inside the Openshift Ingress Controller pods) is sending the 503. This haproxy instance may send the 503 for the following Reasons:

Potential Causes to Common Issues with HTTP Response Code 503 from Openshift's Default Ingress Controllers

Host does not exist (Most obvious)

This means that the route is non-existent in the haproxy.config. Either it was typed incorrectly, or it is failing to be set up. As routes are set up with the *.apps domain established on cluster install, you can easily see this one for yourself by running the following:

$ curl -v http://test.apps.<cluster>.<basedomain>

Host exists but does not have a matching path

Ensure that the path was configured correctly via the route object as specified in the Documentation.
Check if the path was typed correctly when client goes to the specified route.

Backend pods are down

Haproxy maps a frontend route to have traffic directed to backend pods. If these pods are not accepting traffic then haproxy will respond to the client with a 503.

Router pods are failing to accept traffic

This is often caused by router pods failing liveness and readiness probes due to the fact that there is not sufficient resources (memory/cpu) on the node hosting the router pod to handle all the requests.
The liveness probe runs an HTTP GET request on localhost:1936/healthz:

    livenessProbe:
      failureThreshold: 3
      httpGet:
        host: localhost
        path: /healthz
        port: 1936
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

The readiness probe does a HTTP GET request on localhost:1936/healthz/ready:

    readinessProbe:
      failureThreshold: 3
      httpGet:
        host: localhost
        path: /healthz/ready
        port: 1936
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

Connection timeouts

Multiple connection timeouts are configured in Openshift's haproxy instance by default. The default timeouts are discussed in the following documentation:
- https://docs.openshift.com/container-platform/4.12/networking/ingress-operator.html#nw-ingress-controller-configuration-parameters_configuring-ingress
- https://docs.openshift.com/container-platform/4.12/networking/routes/route-configuration.html#nw-configuring-route-timeouts_route-configuration

Passthrough connection which terminates TLS at haproxy backend has no SNI

Openshift utilizes SNI on encrypted traffic to determine the correct backend to forward the traffic. Without this header, Openshift will send the traffic to fe_no_sni frontend, and it will get forwarded to a backend of openshift-default/<NOSRV>.

<DATE> compute-0 compute-0.cluster.example.com haproxy[XX]: XX.XX.XX.XX:xxxxx [<DATE>] fe_no_sni~ openshift_default/<NOSRV> 0/-1/-1/-1/0 503 2660 - - SC-- X/X/X/X/X 0/0 "GET /healthz HTTP/1.1"

tlsInspectDelay tuning may need to be implemented

haproxy has a setting called tcp-request inspect-delay which places a timer on a specific layer of haproxy.
This layer is the tcp layer in Openshift 4.x and the rule can be observed here:

  tcp-request inspect-delay {{ firstMatch $timeSpecPattern (env "ROUTER_INSPECT_DELAY") "5s" }}
  tcp-request content accept if { req_ssl_hello_type 1 }

Essentially this demonstrates that a tcp request destined for edge, re-encrypt or passthrough routes will have an inspect-delay of 5 seconds before being marked invalid and this will result in the traffic being sent to fe_no_sni frontend and openshift-default/ backend.
503 can be observed when the initial tcp handshake has delays that exceeds 5 seconds prior to TLS client hello being sent.
This particular parameter should be handled with care because larger delay values may make haproxy vulnerable to Denial of Service attacks. This can be tuned, but you should also look into what is causing the delay.

Diagnostic Steps

Most 503's in Openshift can be diagnosed by enabled access logging and running a tcpdump from the nodes hosting the ingress controllers. Both of these provide details from haproxy perspective, as well as data to inspect traffic behavior in network packet captures. If specifics on backend pods are needed after analysis, then you can follow up with packet captures from the backend pods as specified here. Below lists more specific examples:

Diagnosing Common Issues with HTTP Response Code 503 from Openshift's Default Ingress Controllers

Host does not exist

Check to ensure the route exists:

$ oc get route -n <namespace>

Host exists but does not have a matching path

Path can be inspected through the route object as well:

$ oc get route -n <namespace>
$ oc describe route <route_name> -n <namespace>

Backend pods are down

Inspect the backing application pods.
Check to ensure the containers are not restarting:

$ oc get pods -o wide -n <namespace>
NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE                                             
mypod-xxxxxxxxxx-xxxxx    1/1     Running   2          8h    xx.xx.xx.xx   worker-0.cluster.example.com 
                                            ^

Router pods are failing to accept traffic

Check the events of the openshift-ingress namespace to see if probe failures are occurring.

$ oc get events -n openshift-ingress --sort-by '.lastTimestamp'

Check the logs of the ingress-controller pods to ensure there are no problems.

$ oc logs <pod> -n openshift-ingress

If router pods are consistently being restarted, check to ensure that the node host has enough CPU, Memory, and also inspect Network statistics to ensure that the host is not dropping packets, or suffering from full socket buffers.

$ oc debug node/<node_name>
$ chroot /host
$ top                             # Check for high cpu load
$ free -m                         # Check for memory saturation
$ netstat -s                      # Check network statistics for RX/TX drops, errors, or socket buffer full logs

Connection timeouts

Check to see if the 503 can be reproduced with curl:

while true; do echo -e '\n' "==========" '\n'; date "+%d.%m.%y %H:%M:%S:%N"; curl -w "\n HTTP Response Code: %{http_code} \n\n namelookup: %{time_namelookup}\n connect: %{time_connect}\n appconnect: %{time_appconnect}\n ---\n total: %{time_total}\n\n" -k -s --show-error -o /dev/null https://myapp.apps.cluster.example.com; sleep 1; done

If the 503 can be reproduced, check the response time and compare it to the timeouts specified in the Documentation.

Passthrough connection which terminates TLS at haproxy backend has no SNI

Via ingress access logging check to see if your traffic is being sent to fe_no_sni frontend
A packet capture can be collected and uploaded to Red Hat support for review

tlsInspectDelay tuning may need to be implemented

This is a difficult one to observe - upload a packet capture to Red Hat support for review
In the packet capture, the trend for this issue is that there will be a long delay (5+ seconds) after the initial TCP handshake and before the TLS client hello
Access logs will show something similar to the following:

<DATE> compute-0 compute-0.cluster.example.com haproxy[XX]: XX.XX.XX.XX:xxxxx [<DATE>] fe_no_sni~ openshift_default/<NOSRV> 0/-1/-1/-1/0 503 2660 - - SC-- X/X/X/X 0/0 "GET /healthz HTTP/1.1"
<DATE> compute-0 compute-0.cluster.example.com haproxy[XX]: XX.XX.XX.XX:xxxxx [<DATE>] public_ssl be_no_sni/fe_no_sni X/X/X XXX SD X/X/X/X/X 0/0

In the packet capture, you will see behavior similar to this:

 00:00:30.00000   xx.xx.xx.x1 → xx.xx.xx.x2  xxxxx → 443 [SYN]
 00:00:30.00500   xx.xx.xx.x2 → xx.xx.xx.x1  443 → xxxxx [SYN, ACK]
 00:00:30.00600   xx.xx.xx.x1 → xx.xx.xx.x2  xxxxx → 443 [ACK]
 00:00:52.00100   xx.xx.xx.x1 → xx.xx.xx.x2  TLSv1 Client Hello

 ** 22 seconds pass between TCP handshake and TLS Client Hello

SBR

Shift Networking

Product(s)

Red Hat OpenShift Container Platform

Tags

ocp_4

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.