New kube-apiserver-operator webhook controller validating health of webhook in OpenShift Container Platform 4

Solution Verified - Updated 13 Jun 2024

Environment

Red Hat OpenShift Container Platform (RHOCP) 4.10 and later

Issue

What's the rational for an operator calling directly webhooks, without going through the API? We see kube-apiserver-operator trying to access configured webhooks after updating to OCP 4.10. As we have NetworkPolicies for the namespace hosting the webhook it's not possible to reach them from kube-apiserver-operator hence seeing the following logs:

  $ oc logs -n openshift-kube-apiserver-operator deploy/kube-apiserver-operator
  [...]
  E0622 13:54:11.540867       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:13.541971       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:17.544072       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:19.544769       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:23.561043       1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout
  E0622 13:54:25.562466       1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout

Resolution

Starting with Red Hat OpenShift Container Platform 4.10, a new webhook validation controller was introduced in kube-apiserver-operator running in openshift-kube-apiserver-operator namespace to help validate the health and availability of 3rd party admission plugins.
As of Red Hat OpenShift Container Platform 4.10, the webhook validation controller in kube-apiserver-operator will only log problematic admission plugins to STDOUT and not set the kube-apiserver Cluster Operator to degraded.

As of 4.10+, when configuring multitenant isolation with network policy, if you have configured the default network policies limiting ingress access to each project, the following default network policy should be configured for every tenant project to allow kube-apiserver-operator ingress access to your project, especially if there are webhook pods deployed to it.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-kube-apiserver-operator
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: openshift-kube-apiserver-operator
      podSelector:
        matchLabels:
          app: kube-apiserver-operator
  policyTypes:
  - Ingress

Root Cause

None functioning admission plugins can impact availability and stability of the Red Hat OpenShift Container Platform 4 - API. To quickly detect such problematic condition, a webhook controller was added to kube-apiserver-operator running in openshift-kube-apiserver-operator namespace to report the the problematic admission plugins in the kube-apiserver-operator logs.

Diagnostic Steps

The kube-apiserver-operator running in openshift-kube-apiserver-operator namespace may log problems as reported below in it's logs:

  $ oc logs -n openshift-kube-apiserver-operator deploy/kube-apiserver-operator
  [...]
  E0622 13:54:11.540867       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:13.541971       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:17.544072       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:19.544769       1 degraded_webhook.go:128] dial tcp 10.1.1.249:443: i/o timeout
  E0622 13:54:23.561043       1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout
  E0622 13:54:25.562466       1 degraded_webhook.go:128] dial tcp 10.1.1.101:443: i/o timeout

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

kubernetes

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.