kube-controller-manager timeout is exceeded by validating webhook during CNI restart leading to degraded cluster state

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4.11+
  • An operator that defines rules for configmaps with a timeout bigger than 5s is installed.
    • In the examples below the Aqua operator is used.

Issue

  • Kube Controller Manager pods fail to have a leader election.
  • Networking is degraded and Kube Controller Manager is in crashloopbackoff state.
  • Configmaps are taking longer than 5s to create but deletion occurs in under a second as expected.
  • The status of the resources do not match the reality:
    • After following guidance to restart OVNKube-Node pods as part of a reset of OVN databases, the daemonset reports all pods in READY state, but no pods are running on the corresponding nodes.
    • The nodes status is Ready, but they are not.

Resolution

  • Search for webhooks (validatingwebhookconfiguration) affecting the resource configmaps that also have a timeout higher than 5 seconds:

    $ oc get validatingwebhookconfiguration -o yaml | egrep 'name:|timeoutSeconds:|configmaps'
    
  • Modify the matching validatingwebhookconfiguration objects and reduce the timeout value to 3s:

    apiVersion: admissionregistration.k8s.io/v1
    kind: ValidatingWebhookConfiguration
    ...
    rules:
      - apiGroups:
        - '*'
        apiVersions:
        - '*'
        operations:
        - CREATE
        - UPDATE
        resources:
        - pods
        - deployments
        - replicasets
        - replicationcontrollers
        - statefulsets
        - daemonsets
        - jobs
        - cronjobs
        - configmaps   <==
        - services
        - roles
        - rolebindings
        - clusterroles
        - clusterrolebindings
        - customresourcedefinitions
        scope: '*'
      sideEffects: None
      timeoutSeconds: 3
    
  • NOTE: this issue can ALSO be caused by a failed timeout when resolving api-int.<yourcluster>.<yourdomain> from within the kube-controller-manager pods. Validate that all nameservers are online/accessible for the cluster as it can present exactly the same way as a blocking webhook. To troubleshoot/verify this problem might be caused by DNS See this related KCS: https://access.redhat.com/solutions/7055053

Root Cause

  • Kube Controller Manager has a hard-coded 5s timeout on certain API requests. If calls take longer than that, they will be dropped/rejected.
  • Reducing this timeout to a time frame that is shorter than the minimum response window for Kube Controller Manager will allow the requests to succeed, which will in turn allow the kube-controller-manager-operator to deploy.

Diagnostic Steps

  • Kube Controller Manager pod may be in crashloopbackoff state.

  • Pods are deleted but are not being re-created automatically by the operator or daemonset.

  • openshift-apiserver pods are crash-looping, but openshift-kube-apiserver pods are in RUNNING/available state.

  • The API appears to be stalling out on requests to create new resources but deleting resources can be completed successfully immediately.

  • ETCD appears healthy and is not in READ-ONLY state.

  • Master nodes are in READY and API/API-INT is reachable from both bastion and master nodes consistently (API not flapping).

  • The creation of a configmap, when timed, consistently takes 5s+ to complete, but deletion of said configmap takes under 1s.

  • Check the kube-apiserver container logs in the control-plane nodes (the client cannot be trusted):

    2023-01-24T17:59:14.361328698Z W0124 17:59:14.361306     17 dispatcher.go:142] Failed calling webhook, failing open imageassurance.aquasec.com: failed calling webhook "imageassurance.aquasec.com": failed to call webhook: Post "https://aqua-kube-enforcer.aqua.svc:443/?timeout=5s": context deadline exceeded 
    2023-01-24T17:59:14.361353032Z E0124 17:59:14.361331     17 dispatcher.go:149] failed calling webhook "imageassurance.aquasec.com": failed to call webhook: Post "https://aqua-kube-enforcer.aqua.svc:443/?timeout=5s": context deadline exceeded
    
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.