New kube-apiserver-operator webhook controller validating health of webhook in OpenShift Container Platform 4
Environment
- Red Hat OpenShift Container Platform (RHOCP) 4.10 and later
Issue
-
What's the rational for an operator calling directly webhooks, without going through the API? We see
kube-apiserver-operatortrying to access configuredwebhooksafter updating to OCP 4.10. As we haveNetworkPoliciesfor thenamespacehosting thewebhookit's not possible to reach them fromkube-apiserver-operatorhence seeing the following logs:$ oc logs -n openshift-kube-apiserver-operator deploy/kube-apiserver-operator [...] E0622 13:54:11.540867 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:13.541971 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:17.544072 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:19.544769 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:23.561043 1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout E0622 13:54:25.562466 1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout
Resolution
- Starting with Red Hat OpenShift Container Platform 4.10, a new
webhookvalidation controller was introduced inkube-apiserver-operatorrunning inopenshift-kube-apiserver-operatornamespaceto help validate the health and availability of 3rd party admission plugins. - As of Red Hat OpenShift Container Platform 4.10, the
webhookvalidation controller inkube-apiserver-operatorwill only log problematic admission plugins toSTDOUTand not set thekube-apiserverCluster Operator todegraded.
As of 4.10+, when configuring multitenant isolation with network policy, if you have configured the default network policies limiting ingress access to each project, the following default network policy should be configured for every tenant project to allow kube-apiserver-operator ingress access to your project, especially if there are webhook pods deployed to it.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-kube-apiserver-operator
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: openshift-kube-apiserver-operator
podSelector:
matchLabels:
app: kube-apiserver-operator
policyTypes:
- Ingress
Root Cause
None functioning admission plugins can impact availability and stability of the Red Hat OpenShift Container Platform 4 - API. To quickly detect such problematic condition, a webhook controller was added to kube-apiserver-operator running in openshift-kube-apiserver-operator namespace to report the the problematic admission plugins in the kube-apiserver-operator logs.
Diagnostic Steps
-
The
kube-apiserver-operatorrunning inopenshift-kube-apiserver-operatornamespacemay log problems as reported below in it's logs:$ oc logs -n openshift-kube-apiserver-operator deploy/kube-apiserver-operator [...] E0622 13:54:11.540867 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:13.541971 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:17.544072 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:19.544769 1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout E0622 13:54:23.561043 1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout E0622 13:54:25.562466 1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.