kube-controller-manager timeout is exceeded by validating webhook during CNI restart leading to degraded cluster state
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4.11+
- An operator that defines rules for
configmapswith a timeout bigger than 5s is installed.- In the examples below the
Aqua operatoris used.
- In the examples below the
Issue
Kube Controller Managerpods fail to have a leader election.- Networking is degraded and
Kube Controller Manageris incrashloopbackoffstate. - Configmaps are taking longer than 5s to create but deletion occurs in under a second as expected.
- The status of the resources do not match the reality:
- After following guidance to restart
OVNKube-Nodepods as part of a reset ofOVNdatabases, thedaemonsetreports all pods inREADYstate, but no pods are running on the corresponding nodes. - The nodes status is
Ready, but they are not.
- After following guidance to restart
Resolution
-
Search for webhooks
(validatingwebhookconfiguration)affecting the resourceconfigmapsthat also have a timeout higher than 5 seconds:$ oc get validatingwebhookconfiguration -o yaml | egrep 'name:|timeoutSeconds:|configmaps' -
Modify the matching
validatingwebhookconfigurationobjects and reduce thetimeoutvalue to 3s:apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration ... rules: - apiGroups: - '*' apiVersions: - '*' operations: - CREATE - UPDATE resources: - pods - deployments - replicasets - replicationcontrollers - statefulsets - daemonsets - jobs - cronjobs - configmaps <== - services - roles - rolebindings - clusterroles - clusterrolebindings - customresourcedefinitions scope: '*' sideEffects: None timeoutSeconds: 3 -
NOTE: this issue can ALSO be caused by a failed timeout when resolving
api-int.<yourcluster>.<yourdomain>from within the kube-controller-manager pods. Validate that all nameservers are online/accessible for the cluster as it can present exactly the same way as a blocking webhook. To troubleshoot/verify this problem might be caused by DNS See this related KCS: https://access.redhat.com/solutions/7055053
Root Cause
Kube Controller Managerhas a hard-coded 5s timeout on certain API requests. If calls take longer than that, they will be dropped/rejected.- Reducing this timeout to a time frame that is shorter than the minimum response window for
Kube Controller Managerwill allow the requests to succeed, which will in turn allow thekube-controller-manager-operatorto deploy.
Diagnostic Steps
-
Kube Controller Managerpod may be incrashloopbackoffstate. -
Pods are deleted but are not being re-created automatically by the operator or
daemonset. -
openshift-apiserverpods are crash-looping, butopenshift-kube-apiserverpods are inRUNNING/availablestate. -
The API appears to be stalling out on requests to create new resources but deleting resources can be completed successfully immediately.
-
ETCD appears healthy and is not in READ-ONLY state.
-
Master nodes are in
READYandAPI/API-INTis reachable from both bastion and master nodes consistently (API not flapping). -
The creation of a
configmap, when timed, consistently takes 5s+ to complete, but deletion of saidconfigmaptakes under 1s. -
Check the
kube-apiservercontainer logs in the control-plane nodes (the client cannot be trusted):2023-01-24T17:59:14.361328698Z W0124 17:59:14.361306 17 dispatcher.go:142] Failed calling webhook, failing open imageassurance.aquasec.com: failed calling webhook "imageassurance.aquasec.com": failed to call webhook: Post "https://aqua-kube-enforcer.aqua.svc:443/?timeout=5s": context deadline exceeded 2023-01-24T17:59:14.361353032Z E0124 17:59:14.361331 17 dispatcher.go:149] failed calling webhook "imageassurance.aquasec.com": failed to call webhook: Post "https://aqua-kube-enforcer.aqua.svc:443/?timeout=5s": context deadline exceeded
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.