OVS and SDN Pods in CrashLoopbackOff after upgrade to OpenShift 4.6
Environment
- Red Hat OpenShift Container Platform
- 4.6.1 -> 4.6.6
Issue
- Some OVS Pods in
openshift-sdnnamespace are inCrashLoopbackOffafter upgrade to OpenShift 4.6.
# oc get pods -n openshift-sdn -o wide
ovs-gppl4 0/1 CrashLoopBackOff 6 43h 10.20.196.191 node01.example.com <none> <none>
sdn-97njs 0/2 CrashLoopBackOff 6 43h 10.20.196.191 node01.example.com <none> <none>
Resolution
This issue is reported as resolved in OpenShift 4.6.12 which included This content is not included.a fix for a kernel entropy related issue.
This issue was reviewed by the OpenShift engineering team in This content is not included.Bugzilla 1895024.
If you are seeing this issue after a reboot of a node, you can check if the openvswitch and ovs-configuration services are running. If they are not then you can manually restart them to workaround the issue.
DO NOT DO THIS DURING AN UPGRADE ON A NODE
Check if your node has booted into new osImage first. If its at a "CoreOS 4.6..." version then proceed with restarting services:
$ oc get node worker-0.example.com -o template='{{.status.nodeInfo.osImage}}{{"\n"}}'
Red Hat Enterprise Linux CoreOS 46.82.202011111640-0 (Ootpa)
$ systemctl status openvswitch.service
$ systemctl status ovs-configuration.service
$ systemctl restart openvswitch.service
$ systemctl restart ovs-configuration.service
Root Cause
- At this point, research into this issue shows that the OVS related services are timing out during startup and fail to recover after this.
- A possible root cause is low entropy on the system.
Diagnostic Steps
- Check if OVS related services are running using the following
systemctlcommand or checksos_commands/systemdinside of a Sosreport:
$ systemctl status ovs-configuration openvswitch ovsdb-server ovs-vswitchd
- Check if OVS related services failed during startup using the following
journactlcommand or checkingsos_commands/logsinside of a Sosreport:
$ journactl --no-pager | egrep 'systemd.*? (ovsdb-server|openvswitch|ovs-configuration|ovs-vswitchd).service'
Dec 05 01:21:04 node01.example.com systemd[1]: openvswitch.service: Consumed 2ms CPU time
Dec 05 01:21:04 node01.example.com systemd[1]: ovs-vswitchd.service: Consumed 1min 58.508s CPU time
Dec 05 01:21:05 node01.example.com systemd[1]: ovsdb-server.service: Consumed 13.805s CPU time
Dec 05 01:23:24 localhost systemd[1]: ovsdb-server.service: Start operation timed out. Terminating.
Dec 05 01:24:06 localhost systemd[1]: ovsdb-server.service: Failed with result 'timeout'.
Dec 05 01:24:06 localhost systemd[1]: ovs-configuration.service: Job ovs-configuration.service/start failed with result 'dependency'.
Dec 05 01:24:06 localhost systemd[1]: openvswitch.service: Job openvswitch.service/start failed with result 'dependency'.
Dec 05 01:24:06 localhost systemd[1]: ovs-vswitchd.service: Job ovs-vswitchd.service/start failed with result 'dependency'.
Dec 05 01:24:06 localhost systemd[1]: ovsdb-server.service: Consumed 189ms CPU time
Dec 05 01:24:07 localhost systemd[1]: ovsdb-server.service: Service RestartSec=100ms expired, scheduling restart.
Dec 05 01:24:07 localhost systemd[1]: ovsdb-server.service: Scheduled restart job, restart counter is at 1.
Dec 05 01:24:07 localhost systemd[1]: ovsdb-server.service: Consumed 0 CPU time
- Check if system has low entropy in the following file (file location is the same in Sosreport). Note that low entropy is difficult to determine exactly but This content is not included.if entropy lower than 1,000 that's generally accepted as "low" and could lead to process hangs:
$ cat /proc/sys/kernel/random/entropy_avail
815
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.