net-kourier-controller is unable to start due to a liveness probe error

Solution Verified - Updated 13 Jun 2024

Environment

OpenShift
- 4.x
OpenShift Serverless
- 1.29.0

Issue

net-kourier-controller cannot start due to the following Liveness probe error.

  Warning  BackOff    10m (x2439 over 11h)    kubelet  Back-off restarting failed container controller in pod net-kourier-controller-68654c74f8-vdfl8_knative-serving-ingress(e37d1706-e9d9-4490-b694-2f7472b9ac5e)
  Warning  Unhealthy  4m58s (x3041 over 11h)  kubelet  (combined from similar events): Liveness probe failed: 2023/06/13 05:31:49 failed to connect to service at ":18000": context deadline exceeded

net-kourier-controller has the fllowing logs.

2023/06/13 13:13:42 Registering 3 clients
2023/06/13 13:13:42 Registering 4 informer factories
2023/06/13 13:13:42 Registering 5 informers
2023/06/13 13:13:42 Registering 1 controllers
{"severity":"INFO","timestamp":"2023-06-13T13:13:42.44958093Z","logger":"net-kourier-controller","caller":"profiling/server.go:64","message":"Profiling enabled: false","commit":"d2ebd7e-dirty"}
{"severity":"INFO","timestamp":"2023-06-13T13:13:42.452661244Z","logger":"net-kourier-controller","caller":"leaderelection/context.go:47","message":"Running with Standard leader election","commit":"d2ebd7e-dirty"}
{"severity":"INFO","timestamp":"2023-06-13T13:13:42.612083567Z","logger":"net-kourier-controller","caller":"ingress/controller.go:211","message":"Priming the config with 192 ingresses","commit":"d2ebd7e-dirty"}
{"severity":"EMERGENCY","timestamp":"2023-06-13T13:14:42.17852954Z","logger":"net-kourier-controller","caller":"ingress/controller.go:233","message":"Failed prewarm ingress","commit":"d2ebd7e-dirty","error":"failed to translate ingress: failed to fetch endpoints 'kb-fast-0/receiver-00001': client rate limiter Wait returned an error: context canceled","stacktrace":"knative.dev/net-kourier/pkg/reconciler/ingress.NewController\n\t/app/pkg/reconciler/ingress/controller.go:233\nknative.dev/pkg/injection/sharedmain.ControllersAndWebhooksFromCtors\n\t/app/vendor/knative.dev/pkg/injection/sharedmain/main.go:412\nknative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/app/vendor/knative.dev/pkg/injection/sharedmain/main.go:252\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/app/vendor/knative.dev/pkg/injection/sharedmain/main.go:191\nmain.main\n\t/app/cmd/kourier/main.go:45\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

My cluster deploys a lot of Ksvc (100~)

Resolution

This is a known issue on 1.29.0.
Here is a workaround solution:
Edit your knativeserving CR

$ oc edit -n knative-serving knativeserving ${YOUR_CR_NAME}

Add spec.deployments with higher failureThreshold

spec:
  deployments:
  - name: net-kourier-controller
    livenessProbes:
      - container: controller
        failureThreshold: 10000

Updated on July 27 2023

Serverless 1.29.1 (The release is scheduled for early August 2023) contains the bug fix.

Root Cause

The slow startup of net-kourier-controller is the cause.
Although the LivenessProbe was added in the 1.29.0 release, the net-kourier-controller takes longer to start up, which leads to the LivenessProbe's threshold being reached before the startup is completed.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.