net-kourier-controller is unable to start due to a liveness probe error
Environment
-
OpenShift
- 4.x
-
OpenShift Serverless
- 1.29.0
Issue
- net-kourier-controller cannot start due to the following Liveness probe error.
Warning BackOff 10m (x2439 over 11h) kubelet Back-off restarting failed container controller in pod net-kourier-controller-68654c74f8-vdfl8_knative-serving-ingress(e37d1706-e9d9-4490-b694-2f7472b9ac5e)
Warning Unhealthy 4m58s (x3041 over 11h) kubelet (combined from similar events): Liveness probe failed: 2023/06/13 05:31:49 failed to connect to service at ":18000": context deadline exceeded
- net-kourier-controller has the fllowing logs.
2023/06/13 13:13:42 Registering 3 clients
2023/06/13 13:13:42 Registering 4 informer factories
2023/06/13 13:13:42 Registering 5 informers
2023/06/13 13:13:42 Registering 1 controllers
{"severity":"INFO","timestamp":"2023-06-13T13:13:42.44958093Z","logger":"net-kourier-controller","caller":"profiling/server.go:64","message":"Profiling enabled: false","commit":"d2ebd7e-dirty"}
{"severity":"INFO","timestamp":"2023-06-13T13:13:42.452661244Z","logger":"net-kourier-controller","caller":"leaderelection/context.go:47","message":"Running with Standard leader election","commit":"d2ebd7e-dirty"}
{"severity":"INFO","timestamp":"2023-06-13T13:13:42.612083567Z","logger":"net-kourier-controller","caller":"ingress/controller.go:211","message":"Priming the config with 192 ingresses","commit":"d2ebd7e-dirty"}
{"severity":"EMERGENCY","timestamp":"2023-06-13T13:14:42.17852954Z","logger":"net-kourier-controller","caller":"ingress/controller.go:233","message":"Failed prewarm ingress","commit":"d2ebd7e-dirty","error":"failed to translate ingress: failed to fetch endpoints 'kb-fast-0/receiver-00001': client rate limiter Wait returned an error: context canceled","stacktrace":"knative.dev/net-kourier/pkg/reconciler/ingress.NewController\n\t/app/pkg/reconciler/ingress/controller.go:233\nknative.dev/pkg/injection/sharedmain.ControllersAndWebhooksFromCtors\n\t/app/vendor/knative.dev/pkg/injection/sharedmain/main.go:412\nknative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/app/vendor/knative.dev/pkg/injection/sharedmain/main.go:252\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/app/vendor/knative.dev/pkg/injection/sharedmain/main.go:191\nmain.main\n\t/app/cmd/kourier/main.go:45\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
- My cluster deploys a lot of Ksvc (100~)
Resolution
-
This is a known issue on 1.29.0.
-
Here is a workaround solution:
-
Edit your knativeserving CR
$ oc edit -n knative-serving knativeserving ${YOUR_CR_NAME}
- Add spec.deployments with higher
failureThreshold
spec:
deployments:
- name: net-kourier-controller
livenessProbes:
- container: controller
failureThreshold: 10000
Updated on July 27 2023
- Serverless 1.29.1 (The release is scheduled for early August 2023) contains the bug fix.
Root Cause
- The slow startup of net-kourier-controller is the cause.
- Although the LivenessProbe was added in the 1.29.0 release, the net-kourier-controller takes longer to start up, which leads to the LivenessProbe's threshold being reached before the startup is completed.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.