How to troubleshoot CRI-O and gather a CRI-O go routine stack
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
- containers not getting created/deleted
crictlcommands not responding- CRI-O using much more memory than usual
- how to gather a CRI-O go routine stack
Resolution
-
In general it is good to start with a baseline, collect an sosreport from the node having issues. It will contain a general health of the node, journal logs (which include the crio logs) and the service statuses. See gather an sosreport from the node
-
Having the sosreport should help, but in some cases we will need further data.
Setting debug logging for cri-o will generate a lot more logs which should help pointing out issues if they arise. Note that this restarts the process, so may hide the current problem. See How to configure CRI-O logLevel in OpenShift 4 -
If cri-o is not doing certain operations or using a lot more memory than usual, it may have go routines not completing, but Cri-o will still be responsive. Support may request that the operator will need to execute on the node the following commands to print the go routine stacks. This will not "kill" the process, only send a
USR1signal to the process.
kill -USR1 $crio-pid
systemctl kill -s USR1 crio.service
CRI-O will catch the signal, and write the routine stacks to /tmp/crio-goroutine-stacks-${timestamp}.log
Attach the file onto the case/bugzilla/issue.
If the process is entirely non-responsive, then it may require attaching a strace or a cri-o Coredump.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.