Troubleshooting OpenShift Container Platform 3.x: Basics

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3
  • Red Hat Enterprise Linux (RHEL)
    • 7
    • Atomic

Issue

I have a problem in my OpenShift Container Platform 3 environment that I need to investigate.

  • How can I troubleshoot it?
  • What logs can I inspect?
  • How can I modify the log level / detail that openshift generates?
  • I need to provide supporting data to technical support for analysis. What information is needed?

Resolution

Note: For OpenShift 4.x see Troubleshooting OpenShift Container Platform 4.x

Table of Contents

General environment information

A starting point for data collection from an OpenShift master or node is a sosreport that includes docker and OpenShift related information. The process to collect a sosreport is the same as with any other Red Hat Enterprise Linux (RHEL) based system:

# yum update sos
# sosreport 

Please see: Generating an sosreport for Atomic Host or an OpenShift 3 master/node for more detail on collecting a sosreport, for RHEL / Atomic Host.

More Detailed environment information

Please see: Collecting OpenShift Cluster, Master, Node, and Project information to understand what other commands support may need that are not collected by a sosreport.

Application Level Debugging Commands

Please see: Troubleshooting an applications images on OpenShift 3.x to see what commands we may want to collect when debugging an application.

Server log levels

See OpenShift log configuration settings for more information on setting up logging levels.

You can control the log level of OpenShift components via their --loglevel parameter. This parameter can be set in the OPTIONS for the relevant component's configuration within /etc/sysconfig.

For example to set OpenShift Nodes's log level to Debug add or edit this line in /etc/sysconfig/atomic-openshift-node:

OPTIONS=--loglevel=4

and then restart the service with systemctl restart atomic-openshift-node.

Client command line troubleshooting

The Command Line Interface (CLI) commands oc also accepts a loglevel option that can help obtain additional details of what's happening during command invocation. Setting this value between 6 and 8 will provide extensive logging, including the API requests being sent (loglevel 6), their headers (loglevel 7) and the responses received (loglevel 8):

$ oc whoami --loglevel=8

Registry issues

Troubleshooting Registry

To collect information about the pod that is running the internal registry:

# oc get pods -n default | grep registry | awk '{ print $1 }' | xargs -i oc describe pod {}

A basic health check to verify that the internal registry is running and responding to its service address is to "ping" its /healthz path. Under normal circumstances this should return a 200 HTTP resonse:

# RegistryAddr=$(oc get service -l docker-registry -n default -o=jsonpath='{range.items[*].spec}{.clusterIP}:{.ports[0].port}{"\n"}{end}')
# curl -vk $RegistryAddr/healthz

Network issues

Network related problems can have many different causes and symptoms.

See the following documentation on how to This page is not included, but the link has been rewritten to point to the nearest parent document.troubleshoot Openshift's SDN.

Routing issues

If you are having issues with routes to your application gather the following information.

$ oc logs dc/router -n default
$ oc get dc/router -o yaml -n default
$ oc get route <NAME_OF_ROUTE> -n <PROJECT>
$ oc get endpoints --all-namespaces 
$ oc exec -it $ROUTER_POD -- ls -la 
$ oc exec -it $ROUTER_POD -- find /var/lib/haproxy -regex ".*\(.map\|config.*\|.json\)" -print -exec cat {} \; > haproxy_configs_and_maps
  • Use the information above to check that our basic configuration files have the proper routing information and time-stamps. Meaning that updates are happening.
  • If you do see that the router configs / files are not updated (in some time) and you don't see any "Observed a panic" messages in the logs. Collect an strace of the router process, to look for such messages, that may not have been logged properly.
    • Note: This assumes you are system:admin and that you are in the default project (it may work with other users who have cluster-admin permissions).

If the pod is serving routes and is reachable via the standard http(s) ports. Then we should verify that the DNS domain names assigned to your applications resolve to that node's address where the router is running. If your application's domains is /apps.example.com/ you can check with dig for an ANSWER containing the nodes IP address.

$ dig +norecurse \*.apps.example.com

If certificates are not being served out correctly check the logs for errors, make sure the if present that the vip or proxy in front of the routers off loads SSL to the router (SSL Passthrough). To confirm what certificates are being severed out run the following.

 # echo -n | openssl s_client -connect 192.168.1.5:443 -servername myawesome.apps.example.com 2>&1 | openssl x509 -noout -text
 # curl -kv https://myawesome.apps.example.com 

Control loop Debugging

In Content from github.com is not included.3.4, Content from github.com is not included.3.5, and Content from github.com is not included.later, the OCP router's have the capability to enable "profile debugging". This implements an http endpoint controlled debugger, by setting, OPENSHIFT_PROFILE=web.

  • Due to how this feature works in some situations it may be necessary for you to override the address it listens
    on (default is 127.0.0.1) and the port (default 6061) using the OPENSHIFT_PROFILE_HOST and OPENSHIFT_PROFILE_PORT environment variables respectively.

This is debugging tool, is disabled by default until OPENSHIFT_PROFILE=web is set.

You can enable or use builtin golang profiling tools to get additional debugging out of the router (these should be on by default).

# oc env dc/router GOTRACEBACK=2
# oc env dc/router OPENSHIFT_PROFILE=web

You then need to re-deploy the router[s] to have the changes picked up.

# oc deploy dc/router --latest 

With these variable set, you can run the following from the node where the router is running:

# curl http://127.0.0.1:6061/debug/pprof/goroutine?debug=1 > goroutine_debug_1
# curl http://127.0.0.1:6061/debug/pprof/goroutine?debug=2 > goroutine_debug_2
# curl http://127.0.0.1:6061/debug/pprof/block?debug=1 > block_debug_1

Image builds, Source to Image (S2I / STI)

If you want to troubleshoot a particular build of "myapp" you can view logs with:

$ oc logs bc/<build_name>

You can adjust the verbosity of a build's logs by specifying a BUILD_LOGLEVEL environment variable in the buildConfig's strategy.

See OpenShift build log settings for examples and more details.

Installer issues

If you are using the Advanced installation option there are some details to collect when troubleshooting an installer problem:

  • The inventory file is the other key ingredient. This is /etc/ansible/hosts by default, or the file specified with the -i option when you run ansible

  • It can help to re-run a playbook with increasing verbosity: run ansible-playbook again adding -vvv and redirect the output to a file

See also this knowledge base entry with more details about troubleshooting the installer.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.