TLS handshake fails due to large packets discarded for OpenShift 4 on Azure

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • OpenShift SDN
      • 4.6
      • 4.7
    • OVN-Kubernetes
      • 4.8
      • 4.9
      • 4.10
  • Azure

Issue

  • TLS handshake errors occur although TCP communication is possible. A traffic capture shows large packets being discarded (see "Diagnostic Steps").
  • Unexpected ICMP fragmentation needed messages are received for direct communications happening between OpenShift nodes but without vxlan encapsulation. Requested MTU is lower than the one set in both ends and/or required by any intermediate element.
  • Routing cache shows bad entries as described in "Diagnostic Steps".
  • After some time, the OpenShift Cluster becomes very slow and many operators start to become unhealthy (degraded state).

Resolution

There is a known issue with the MTU in Azure, and there are fixes already implemented for several OpenShift releases.

OpenShift SDN

Target Minor ReleaseBugFixed VersionErrata
4.6This content is not included.BZ 18515494.6.38RHBA-2021:2641
4.7This content is not included.BZ 19679944.7.18RHBA-2021:2502

OVN-Kubernetes

Target Minor ReleaseBugFixed VersionErrata
4.8This content is not included.BZ 20382954.8.39RHBA-2022:1427
4.9This content is not included.BZ 20382524.9.22RHSA-2022:0561
4.10This content is not included.BZ 19884834.10.3RHSA-2022:0056

IMPORTANT: If the workaround for this issue was applied before the upgrade, it's needed to remove the routefix daemonset and the whole azure-routefix namespace. The new issue is tracked in This content is not included.BZ 1979312.

Workaround for previous versions

IMPORTANT NOTE: This workaround only applies for Azure, which is already fixed in current supported OpenShift releases. For different platforms and recent releases, please This content is not included.open a Support Case.

Check if either the source or the destination of the failing communication has the bad ip route cache entries described in "Diagnostic Steps" (otherwise, a different issue would be hit and this solution wouldn't apply).

Manual Workaround

To restore communication in a failing scenario is to run the following as root on any affected Red Hat OpenShift Container Platform - Node:

# ip route flush cache

If you find the bad route cache entries and can solve the issue with the workaround above, please open a Support Case so Red Hat can assist and track.

Also note that workaround effects are temporary, so you may need to re-apply it if the issue appears again.

Daemonset Workaround

A more consistent way to address this issue for all OpenShift nodes is by creating a DaemonSet. Steps for that are:

  1. First create azure-routefix namespace:
$ oc new-project azure-routefix
  1. Create a new SCC:
allowHostDirVolumePlugin: true
allowHostIPC: true
allowHostNetwork: true
allowHostPID: true
allowHostPorts: true
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities:
- '*'
allowedUnsafeSysctls:
- '*'
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups: []
kind: SecurityContextConstraints
metadata:
  name: privileged-routefix
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities: null
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: RunAsAny
seccompProfiles:
- '*'
supplementalGroups:
  type: RunAsAny
users:
- system:serviceaccount:azure-routefix:routefix
volumes:
- '*'
  1. Create the ServiceAccount:
apiVersion: v1
kind: ServiceAccount
metadata:
  name: routefix
  namespace: azure-routefix
  1. Create a ClusterRoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: routefix
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: openshift-sdn-controller
subjects:
- kind: ServiceAccount
  name: routefix
  namespace: azure-routefix
  1. Create the ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
  name: add-iptables
data:
  add_iptables.sh: |
   #!/bin/sh
   echo "Adding ICMP drop rule for '$2' "
   #iptables -C CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION || iptables -A CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION
   if iptables -C CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION
   then
       echo "iptables already set for $2"
   else
       iptables -A CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION
   fi
   #iptables -nvL
  1. Finally, create the DaemonSet:

IMPORTANT: After applying the DaemonSet, please make sure to reboot every OpenShift nodes (master, worker, infra) one by one.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: '1'
  name: routefix
  namespace: azure-routefix
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: routefix
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: routefix
    spec:
      serviceAccount: routefix
      serviceAccountName: routefix
      hostNetwork: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      containers:
      - name: drop-icmp
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcc1fb20f06f00829727cb46ff21e22103fd4c737fdcbbf2fab13121f31ebcbd
        args:
        - /bin/bash

        - -c
-
          set -xe
          echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}"
          iptables -X CHECK_ICMP_SOURCE || true
          iptables -N CHECK_ICMP_SOURCE || true
          iptables -F CHECK_ICMP_SOURCE
          iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE || true
          iptables -I INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE
          iptables -N ICMP_ACTION || true
          iptables -F ICMP_ACTION
          iptables -A ICMP_ACTION -j LOG
          iptables -A ICMP_ACTION -j DROP
          /host/usr/bin/oc observe nodes -a '{ .status.addresses[1].address }' -- /tmp/add_iptables.sh
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - echo drop-icmp done
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host
          name: host
        - mountPath: /tmp/add_iptables.sh
          name: add-iptables
          subPath: add_iptables.sh
        resources:
          requests:
            cpu: 10m
            memory: 300Mi
        env:
        - name: K8S_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
      - args:
        - /bin/bash

        - -c
-
          while true;
          do
              NOW=$(date "+%Y-%m-%d %H:%M:%S")
              DROPPED_PACKETS=$(ovs-ofctl -O OpenFlow13 dump-flows unix:/host/var/run/openvswitch/br0.mgmt | sed -ne '/table=10,.* actions=drop/ { s/.* n_packets=//; s/,.*//; p }')
              if [ "$DROPPED_PACKETS" != "" ] && [ "$DROPPED_PACKETS" -gt 1000 ];
              then
                  echo "$NOW table=10 actions=drop packets=$DROPPED_PACKETS broken=true"
              else
                  echo "$NOW table=10 actions=drop packets=$DROPPED_PACKETS broken=false"
              fi
              sleep 60
          done
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcc1fb20f06f00829727cb46ff21e22103fd4c737fdcbbf2fab13121f31ebcbd
        imagePullPolicy: IfNotPresent
        name: detect
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host
          name: host
          readOnly: true
      volumes:
      - name: host
        hostPath:
          path: /
          type: ''
      - name: add-iptables
        configMap:
          defaultMode: 365
          name: add-iptables
      tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists

IMPORTANT: After applying the DaemonSet, please make sure to reboot every OpenShift nodes (master, worker, infra) one by one.

Root Cause

Large packets are discarded, as if the MTU was lower than it actually is.

Diagnostic Steps

The main symptom of this issue is that large packets are discarded, as if the MTU was lower than it actually is.

A typical manifestation of this issue is when TLS handshakes fails although TCP connection is possible. A capture shows that shorter packets go through, but longer packets (like the TLS Server Hello) don't.

A good way to demonstrate this issue is to make pings with Don't Fragment bit set and size equal to MTU-28 (20 bytes for IP header and 8 bytes for ICMP header). Without the bug, such a ping would succeed, as packets would have MTU size. But this issue makes this kind of pings fail with Message too long, as if the MTU was lower at any of the ends or in an intermediate element (even if that is not the case).

For example: while using default 1500 MTU, a ping with size 1472 should have succeeded, but instead, the following happens:

$ ping -M do -c4 -s 1472 192.168.99.99
PING 192.168.99.99 (192.168.99.99) 1472(1500) bytes of data.​
ping: local error: Message too long, mtu=1450​
ping: local error: Message too long, mtu=1450​
ping: local error: Message too long, mtu=1450​
ping: local error: Message too long, mtu=1450​

The same error is received as if the MTU was 1450 instead of 1500.

The reason for this is that ip route cache contains entries with wrong MTU (lower than expected) like in this example:

# ip route show cache
10.0.0.99 dev eth0 
    cache expires 111sec mtu 1450 

MTU to that destination should be 1500, but the entry contains a 1450 MTU.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.