coredns and keepalived Pods in a non-ready state in RHOCP 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    o 4.8

Issue

  • Following alert is present Pod openshift-ovirt-infra/coredns-ocp-worker-0 has been in a non-ready state for longer than 15 minutes.

  • coredns and keepalived replica Pods under openshift-ovirt-infra namespace are in Pending Status on the same worker node.

  • Init Containers render-config-keepalived and render-config-dns are not in Terminated State.

  • kubelet journal log in worker where Pods are in Pending shows the following exception.

    Apr 07 00:16:08 coredns-ocp-worker-0 hyperkube[1307937]: E0407 00:16:08.595464 1307937 kubelet.go:1695] "Failed creating a mirror pod for" err="pods 
    \"coredns-ocp-worker-0\" already exists" pod="openshift-ovirt-infra/coredns-ocp-worker-0"
    

Resolution

coredns and keepalived Pods are static Pods and yaml definition is present here: /etc/kubernetes/manifests/coredns.yaml.

Drain the worker node where Pods are in Pending Status, official document.

$ oc get nodes ocp-worker-0
ocp-worker-0   Ready,SchedulingDisabled   pre,worker          100d   v1.21.6+bb8d50a

Move the coredns.yml file to a tmp location such as /tmp.

# oc debug node/<worker>
sh-4.4# chroot /host
sh-4.4# cd /etc/kubernetes/manifests/
sh-4.4# mv coredns.yaml /tmp/

Check that the coredns Pod that was in Pending Status on the worker is no longer present.

# oc get po -n openshift-ovirt-infra

Move back the coredns.yaml file from /tmp to /etc/kubernetes/manifests.

sh-4.4# mv /tmp/coredns.yaml /etc/kubernetes/manifests/

Check the Pod status now, it should be in Running.

$ oc get pods -o wide | grep -i ocp-worker-0 | grep dns
coredns-ocp-worker-0      2/2     Running   0          8m44s   10.0.0.1   ocp-worker-0   <none>           <none>

Apply the same steps for keepalived Pod.

Different bugs were opened for addressing the CRI-O panic on several releases, refer to [KCS#6814121](https://access.redhat.com/solutions/6814121 "Crio panics with "panic: close of closed channel " after attempting to stop a container in OpenShift Container Platform 4").

Root Cause

After CRI-O panic/crash and kubelet restart the re-scheduling of static Pods failed due to this Init Containers did not run successfully.

Diagnostic Steps

Pods are in Pending status on the same worker.

$ oc get po -A -o wide | grep -v "Running" | egrep -i 'coredns|keepalived'
openshift-ovirt-infra                             coredns-ocp-worker-0                           2/2    Pending    0         18h    10.0.0.1   ocp-worker-0   
openshift-ovirt-infra                             keepalived-ocp-worker-0                          2/2    Pending    0         23h    10.0.0.1    ocp-worker-0   

Pod .status.conditions show the Reason as ContainersNotInitialized for render-config-* Init Container.

$ oc  -n openshift-ovirt-infra get po coredns-ocp-worker-0 -o jsonpath='{.status.conditions}'
[{"lastProbeTime":null,"lastTransitionTime":"2022-04-07T12:39:40Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2022-04-07T12:39:48Z", message: 'containers with incomplete status: [render-config-coredns]', reason: "ContainersNotInitialized", "status":"False","type":"Initialized"}]

CRI-O panic/crash before coredns and/or keepalived Pods were re-started and moved into Pending Status.

# oc adm node-logs -u crio ocp-worker-0
Apr 07 00:15:46 ocp-worker-0 crio[2049]: goroutine 5727389 [running]:
Apr 07 00:15:46 ocp-worker-0 crio[2049]: panic(0x555ab05e5500, 0x555ab0880208)
Apr 07 00:15:46 ocp-worker-0 crio[2049]:         /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc00123b530 sp=0xc00123b468 pc=0x555aae769745
Apr 07 00:15:46 ocp-worker-0 crio[2049]: runtime.closechan(0xc0016ef800)
Apr 07 00:15:46 ocp-worker-0 crio[2049]:         /usr/lib/golang/src/runtime/chan.go:363 +0x3f5 fp=0xc00123b570 sp=0xc00123b530 pc=0x555aae737a55
Apr 07 00:15:46 ocp-worker-0 crio[2049]: github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer.func1(0xc00123b678, 0xc001dab800)
Apr 07 00:15:46 ocp-worker-0 crio[2049]:         /builddir/build/BUILD/cri-o-aebb17b9285b6f8100c2e5aa1509dd8bc6414f6f/_output/src/github.com/cri-o/cri-o/internal/oci/runtime_oci.go:688 +0x49 fp=0xc00123b588 sp=0xc00123b570 pc=0x555aafdab629
...
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio.service: Main process exited, code=killed, status=6/ABRT
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio.service: Failed with result 'signal'.
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio.service: Consumed 2h 59min 50.041s CPU time
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio.service: Service RestartSec=100ms expired, scheduling restart.
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio.service: Scheduled restart job, restart counter is at 1.
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio.service: Consumed 2h 59min 50.041s CPU time
Apr 07 00:15:48 ocp-worker-0  crio[1305511]: time="2022-04-07 00:15:48.636009465Z" level=info msg="Starting CRI-O, version: 1.21.4-9.rhaos4.8.gitaebb17b.el8, git: ()"
Apr 07 00:15:48 ocp-worker-0 systemd[1]: crio-wipe.service: Succeeded.
Apr 07 00:15:48 ocp-worker-0  systemd[1]: crio-wipe.service: Consumed 94ms CPU time
...
Apr 07 00:15:49 ocp-worker-0 systemd-coredump[1305481]: Process 2049 (crio) of user 0 dumped core.

Kubelet failed to re-schedule coredns and/or keepalived Pods in the worker.

# oc adm node-logs -u kubelet ocp-worker-0
Apr 07 00:16:08 ocp-worker-0 hyperkube[1307937]: E0407 00:16:08.595464 1307937 kubelet.go:1695] "Failed creating a mirror pod for" err="pods \"coredns-ocp-worker-0\" already exists" pod="openshift-ovirt-infra/coredns-ocp-worker-0"
---
Apr 07 00:16:08 ocp-worker-0 hyperkube[1307937]: E0407 00:16:08.593200 1307937 kubelet.go:1695] "Failed creating a mirror pod for" err="pods \"keepalived-ocp-worker-0\" already exists" pod="openshift-ovirt-infra/keepalived-ocp-worker-0"
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.