Troubleshooting OpenShift Container Platform 3.x: Basics
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 3
- Red Hat Enterprise Linux (RHEL)
- 7
- Atomic
Issue
I have a problem in my OpenShift Container Platform 3 environment that I need to investigate.
- How can I troubleshoot it?
- What logs can I inspect?
- How can I modify the log level / detail that openshift generates?
- I need to provide supporting data to technical support for analysis. What information is needed?
Resolution
Note: For OpenShift 4.x see Troubleshooting OpenShift Container Platform 4.x
Table of Contents
- General environment information
- Detailed environment information
- Application Diagnostics
- Log levels
- Command line
- Registry issues
- Routing issues
- Builds
- Installer issues
- DNS issues
- Metrics issues
- Logging issues
- ETCD issues
General environment information
A starting point for data collection from an OpenShift master or node is a sosreport that includes docker and OpenShift related information. The process to collect a sosreport is the same as with any other Red Hat Enterprise Linux (RHEL) based system:
# yum update sos
# sosreport
Please see: Generating an sosreport for Atomic Host or an OpenShift 3 master/node for more detail on collecting a sosreport, for RHEL / Atomic Host.
More Detailed environment information
Please see: Collecting OpenShift Cluster, Master, Node, and Project information to understand what other commands support may need that are not collected by a sosreport.
Application Level Debugging Commands
Please see: Troubleshooting an applications images on OpenShift 3.x to see what commands we may want to collect when debugging an application.
Server log levels
See OpenShift log configuration settings for more information on setting up logging levels.
You can control the log level of OpenShift components via their --loglevel parameter. This parameter can be set in the OPTIONS for the relevant component's configuration within /etc/sysconfig.
For example to set OpenShift Nodes's log level to Debug add or edit this line in /etc/sysconfig/atomic-openshift-node:
OPTIONS=--loglevel=4
and then restart the service with systemctl restart atomic-openshift-node.
Client command line troubleshooting
The Command Line Interface (CLI) commands oc also accepts a loglevel option that can help obtain additional details of what's happening during command invocation. Setting this value between 6 and 8 will provide extensive logging, including the API requests being sent (loglevel 6), their headers (loglevel 7) and the responses received (loglevel 8):
$ oc whoami --loglevel=8
Registry issues
To collect information about the pod that is running the internal registry:
# oc get pods -n default | grep registry | awk '{ print $1 }' | xargs -i oc describe pod {}
A basic health check to verify that the internal registry is running and responding to its service address is to "ping" its /healthz path. Under normal circumstances this should return a 200 HTTP resonse:
# RegistryAddr=$(oc get service -l docker-registry -n default -o=jsonpath='{range.items[*].spec}{.clusterIP}:{.ports[0].port}{"\n"}{end}')
# curl -vk $RegistryAddr/healthz
Network issues
Network related problems can have many different causes and symptoms.
See the following documentation on how to This page is not included, but the link has been rewritten to point to the nearest parent document.troubleshoot Openshift's SDN.
Routing issues
If you are having issues with routes to your application gather the following information.
$ oc logs dc/router -n default
$ oc get dc/router -o yaml -n default
$ oc get route <NAME_OF_ROUTE> -n <PROJECT>
$ oc get endpoints --all-namespaces
$ oc exec -it $ROUTER_POD -- ls -la
$ oc exec -it $ROUTER_POD -- find /var/lib/haproxy -regex ".*\(.map\|config.*\|.json\)" -print -exec cat {} \; > haproxy_configs_and_maps
- Use the information above to check that our basic configuration files have the proper routing information and time-stamps. Meaning that updates are happening.
- If you do see that the router configs / files are not updated (in some time) and you don't see any "Observed a panic" messages in the logs. Collect an strace of the router process, to look for such messages, that may not have been logged properly.
- Note: This assumes you are system:admin and that you are in the default project (it may work with other users who have
cluster-adminpermissions).
- Note: This assumes you are system:admin and that you are in the default project (it may work with other users who have
If the pod is serving routes and is reachable via the standard http(s) ports. Then we should verify that the DNS domain names assigned to your applications resolve to that node's address where the router is running. If your application's domains is /apps.example.com/ you can check with dig for an ANSWER containing the nodes IP address.
$ dig +norecurse \*.apps.example.com
If certificates are not being served out correctly check the logs for errors, make sure the if present that the vip or proxy in front of the routers off loads SSL to the router (SSL Passthrough). To confirm what certificates are being severed out run the following.
# echo -n | openssl s_client -connect 192.168.1.5:443 -servername myawesome.apps.example.com 2>&1 | openssl x509 -noout -text
# curl -kv https://myawesome.apps.example.com
-
Where 192.168.1.5 is the IP of the node where the router is located, and "myawesome.apps.example.com" is the hostname passed to the route.
Control loop Debugging
In Content from github.com is not included.3.4, Content from github.com is not included.3.5, and Content from github.com is not included.later, the OCP router's have the capability to enable "profile debugging". This implements an http endpoint controlled debugger, by setting, OPENSHIFT_PROFILE=web.
- Due to how this feature works in some situations it may be necessary for you to override the address it listens
on (default is 127.0.0.1) and the port (default 6061) using the OPENSHIFT_PROFILE_HOST and OPENSHIFT_PROFILE_PORT environment variables respectively.
This is debugging tool, is disabled by default until OPENSHIFT_PROFILE=web is set.
You can enable or use builtin golang profiling tools to get additional debugging out of the router (these should be on by default).
# oc env dc/router GOTRACEBACK=2
# oc env dc/router OPENSHIFT_PROFILE=web
You then need to re-deploy the router[s] to have the changes picked up.
# oc deploy dc/router --latest
With these variable set, you can run the following from the node where the router is running:
# curl http://127.0.0.1:6061/debug/pprof/goroutine?debug=1 > goroutine_debug_1
# curl http://127.0.0.1:6061/debug/pprof/goroutine?debug=2 > goroutine_debug_2
# curl http://127.0.0.1:6061/debug/pprof/block?debug=1 > block_debug_1
Image builds, Source to Image (S2I / STI)
If you want to troubleshoot a particular build of "myapp" you can view logs with:
$ oc logs bc/<build_name>
You can adjust the verbosity of a build's logs by specifying a BUILD_LOGLEVEL environment variable in the buildConfig's strategy.
See OpenShift build log settings for examples and more details.
Installer issues
If you are using the Advanced installation option there are some details to collect when troubleshooting an installer problem:
-
The inventory file is the other key ingredient. This is
/etc/ansible/hostsby default, or the file specified with the-ioption when you run ansible -
It can help to re-run a playbook with increasing verbosity: run ansible-playbook again adding
-vvvand redirect the output to a file
See also this knowledge base entry with more details about troubleshooting the installer.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.