TLS handshake fails due to large packets discarded for OpenShift 4 on Azure
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- OpenShift SDN
- 4.6
- 4.7
- OVN-Kubernetes
- 4.8
- 4.9
- 4.10
- OpenShift SDN
- Azure
Issue
- TLS handshake errors occur although TCP communication is possible. A traffic capture shows large packets being discarded (see "Diagnostic Steps").
- Unexpected ICMP fragmentation needed messages are received for direct communications happening between OpenShift nodes but without vxlan encapsulation. Requested MTU is lower than the one set in both ends and/or required by any intermediate element.
- Routing cache shows bad entries as described in "Diagnostic Steps".
- After some time, the OpenShift Cluster becomes very slow and many operators start to become unhealthy (degraded state).
Resolution
There is a known issue with the MTU in Azure, and there are fixes already implemented for several OpenShift releases.
OpenShift SDN
| Target Minor Release | Bug | Fixed Version | Errata |
|---|---|---|---|
| 4.6 | This content is not included.BZ 1851549 | 4.6.38 | RHBA-2021:2641 |
| 4.7 | This content is not included.BZ 1967994 | 4.7.18 | RHBA-2021:2502 |
OVN-Kubernetes
| Target Minor Release | Bug | Fixed Version | Errata |
|---|---|---|---|
| 4.8 | This content is not included.BZ 2038295 | 4.8.39 | RHBA-2022:1427 |
| 4.9 | This content is not included.BZ 2038252 | 4.9.22 | RHSA-2022:0561 |
| 4.10 | This content is not included.BZ 1988483 | 4.10.3 | RHSA-2022:0056 |
IMPORTANT: If the workaround for this issue was applied before the upgrade, it's needed to remove the
routefixdaemonsetand the wholeazure-routefixnamespace. The new issue is tracked in This content is not included.BZ 1979312.
Workaround for previous versions
IMPORTANT NOTE: This workaround only applies for Azure, which is already fixed in current supported OpenShift releases. For different platforms and recent releases, please This content is not included.open a Support Case.
Check if either the source or the destination of the failing communication has the bad ip route cache entries described in "Diagnostic Steps" (otherwise, a different issue would be hit and this solution wouldn't apply).
Manual Workaround
To restore communication in a failing scenario is to run the following as root on any affected Red Hat OpenShift Container Platform - Node:
# ip route flush cache
If you find the bad route cache entries and can solve the issue with the workaround above, please open a Support Case so Red Hat can assist and track.
Also note that workaround effects are temporary, so you may need to re-apply it if the issue appears again.
Daemonset Workaround
A more consistent way to address this issue for all OpenShift nodes is by creating a DaemonSet. Steps for that are:
- First create
azure-routefixnamespace:
$ oc new-project azure-routefix
- Create a new SCC:
allowHostDirVolumePlugin: true
allowHostIPC: true
allowHostNetwork: true
allowHostPID: true
allowHostPorts: true
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities:
- '*'
allowedUnsafeSysctls:
- '*'
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
type: RunAsAny
groups: []
kind: SecurityContextConstraints
metadata:
name: privileged-routefix
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities: null
runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
seccompProfiles:
- '*'
supplementalGroups:
type: RunAsAny
users:
- system:serviceaccount:azure-routefix:routefix
volumes:
- '*'
- Create the ServiceAccount:
apiVersion: v1
kind: ServiceAccount
metadata:
name: routefix
namespace: azure-routefix
- Create a ClusterRoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: routefix
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: openshift-sdn-controller
subjects:
- kind: ServiceAccount
name: routefix
namespace: azure-routefix
- Create the ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: add-iptables
data:
add_iptables.sh: |
#!/bin/sh
echo "Adding ICMP drop rule for '$2' "
#iptables -C CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION || iptables -A CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION
if iptables -C CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION
then
echo "iptables already set for $2"
else
iptables -A CHECK_ICMP_SOURCE -p icmp -s $2 -j ICMP_ACTION
fi
#iptables -nvL
- Finally, create the DaemonSet:
IMPORTANT: After applying the DaemonSet, please make sure to reboot every OpenShift nodes (master, worker, infra) one by one.
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: '1'
name: routefix
namespace: azure-routefix
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: routefix
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: routefix
spec:
serviceAccount: routefix
serviceAccountName: routefix
hostNetwork: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
containers:
- name: drop-icmp
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcc1fb20f06f00829727cb46ff21e22103fd4c737fdcbbf2fab13121f31ebcbd
args:
- /bin/bash
- -c
-
set -xe
echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}"
iptables -X CHECK_ICMP_SOURCE || true
iptables -N CHECK_ICMP_SOURCE || true
iptables -F CHECK_ICMP_SOURCE
iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE || true
iptables -I INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE
iptables -N ICMP_ACTION || true
iptables -F ICMP_ACTION
iptables -A ICMP_ACTION -j LOG
iptables -A ICMP_ACTION -j DROP
/host/usr/bin/oc observe nodes -a '{ .status.addresses[1].address }' -- /tmp/add_iptables.sh
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- echo drop-icmp done
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host
name: host
- mountPath: /tmp/add_iptables.sh
name: add-iptables
subPath: add_iptables.sh
resources:
requests:
cpu: 10m
memory: 300Mi
env:
- name: K8S_NODE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- args:
- /bin/bash
- -c
-
while true;
do
NOW=$(date "+%Y-%m-%d %H:%M:%S")
DROPPED_PACKETS=$(ovs-ofctl -O OpenFlow13 dump-flows unix:/host/var/run/openvswitch/br0.mgmt | sed -ne '/table=10,.* actions=drop/ { s/.* n_packets=//; s/,.*//; p }')
if [ "$DROPPED_PACKETS" != "" ] && [ "$DROPPED_PACKETS" -gt 1000 ];
then
echo "$NOW table=10 actions=drop packets=$DROPPED_PACKETS broken=true"
else
echo "$NOW table=10 actions=drop packets=$DROPPED_PACKETS broken=false"
fi
sleep 60
done
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcc1fb20f06f00829727cb46ff21e22103fd4c737fdcbbf2fab13121f31ebcbd
imagePullPolicy: IfNotPresent
name: detect
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host
name: host
readOnly: true
volumes:
- name: host
hostPath:
path: /
type: ''
- name: add-iptables
configMap:
defaultMode: 365
name: add-iptables
tolerations:
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
IMPORTANT: After applying the DaemonSet, please make sure to reboot every OpenShift nodes (master, worker, infra) one by one.
Root Cause
Large packets are discarded, as if the MTU was lower than it actually is.
Diagnostic Steps
The main symptom of this issue is that large packets are discarded, as if the MTU was lower than it actually is.
A typical manifestation of this issue is when TLS handshakes fails although TCP connection is possible. A capture shows that shorter packets go through, but longer packets (like the TLS Server Hello) don't.
A good way to demonstrate this issue is to make pings with Don't Fragment bit set and size equal to MTU-28 (20 bytes for IP header and 8 bytes for ICMP header). Without the bug, such a ping would succeed, as packets would have MTU size. But this issue makes this kind of pings fail with Message too long, as if the MTU was lower at any of the ends or in an intermediate element (even if that is not the case).
For example: while using default 1500 MTU, a ping with size 1472 should have succeeded, but instead, the following happens:
$ ping -M do -c4 -s 1472 192.168.99.99
PING 192.168.99.99 (192.168.99.99) 1472(1500) bytes of data.
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450
The same error is received as if the MTU was 1450 instead of 1500.
The reason for this is that ip route cache contains entries with wrong MTU (lower than expected) like in this example:
# ip route show cache
10.0.0.99 dev eth0
cache expires 111sec mtu 1450
MTU to that destination should be 1500, but the entry contains a 1450 MTU.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.