Troubleshooting OpenShift Container Platform 4.x: UPI Installation

Updated

Definitions

MCO - Machine Config Operator
RHCOS - Red Hat Core OS
Bootstrap Node - The node which is responsible for creating the control plane
Master Nodes - The nodes which comprise the cluster control plane
Worker Nodes - Provide compute capacity

Overview of Installation Process

Below are the various stages of installation. This list is by no means comprehensive but attempts to provide stages of the install which have shown to be problematic for installers.

  1. Ignition files are generated by openshift-installer
  2. Bootstrap and masters install RHCOS
  3. Bootstrap node starts and sets up temporary control plane
  4. Master nodes pull their ignition configuration from the bootstrap MCO :22623/config/master
  5. Master nodes start their kubelet, API server, and etcd.
  6. openshift-install /wait-for-bootstrap waits for the completion of the bootstrap
  7. Bootstrap node is torn down
  8. Worker node(s) install RHCOS
  9. Worker node(s) pull their ignition configuration from the cluster MCO
  10. Worker nodes(s) join the cluster

The installation documentation has very specific infrastructure requirements that should be carefully reviewed:


1. Ignition files are generated by openshift-installer

Ignition files which are generated by the installer contain essential configuration information for the bootstrap node. This includes certificates which are used for authentication throughout the cluster.

Common Issues

  1. Ignition file is greater than 24 hours old and as such certificates expire. See Creating the Ignition config files. A cluster will not install with expired certificates.
  2. The file install-config.yaml is inaccessible. See article for details.

2. Bootstrap and masters install RHCOS

When RHCOS is installed, it requires a few configuration elements:

  • URL of the RHCOS dependencies
  • URL of the ignition file for the node type(bootstrap, master, or worker)
  • Target storage device

Common Issues

  1. URL of the ignition file is invalid
    If the HTTP server address is valid, but the URL returns a 404, the RHCOS installer will not stop. Check the HTTP server logs to confirm RHCOS successfully retrieved the ignition file. See Ignition File not Read During UPI Installation

  2. The HTTP server is not hosting the ignition files
    If the HTTP server is unreachable, RHCOS will fail with a warning message.

  3. The target storage device is incorrect or inaccessible
    RHCOS installation will fail with a warning message.

  4. No IP address has been assigned to the nodes


3. Bootstrap node starts and sets up temporary control plane

The bootstrap creates a single node cluster. This cluster is responsible for hosting ignition configuration that the master will use to complete the installation of RHCOS. The health of the bootstrap node can be verified by reviewing the logs gathered from running installer-gather. Look for the presences of the following files in rendered-assets/openshift:

  • config-bootstrap.done
  • cvo-bootstrap.done
  • kube-apiserver-bootstrap.done
  • kube-controller-manager-bootstrap.done
  • kube-scheduler-bootstrap.done
  • mco-bootstrap.done

If these files do not exist, a review of the logs in bootstrap/containers and bootstrap/journals is required.

etcd errors

It is normal for the bootstrap node bootkube.log to show errors related to the not being able to connect to etcd on the master nodes.

Jul 12  02:58:41  bootstrap.my.cluster.fqdn.com  bootkube.sh[1512]: Waiting for etcd cluster...
Jul 12  03:08:44  bootstrap.my.cluster.fqdn.com  bootkube.sh[1512]: https://etcd-0.my.cluster.fqdn.com:2379 is unhealthy: failed to connect: dial tcp 10.0.0.51:2379: connect: connection refused
Jul 12  03:08:44  bootstrap.my.cluster.fqdn.com  bootkube.sh[1512]: https://etcd-2.my.cluster.fqdn.com:2379 is unhealthy: failed to connect: dial tcp 10.0.0.53:2379: connect: connection refused
Jul 12  03:08:44  bootstrap.my.cluster.fqdn.com  bootkube.sh[1512]: https://etcd-1.my.cluster.fqdn.com:2379 is unhealthy: failed to connect: dial tcp 10.0.0.52:2379: connect: connection refused
Jul 12  03:08:44  bootstrap.my.cluster.fqdn.com  bootkube.sh[1512]: Error: unhealthy cluster
Jul 12  03:08:44  bootstrap.my.cluster.fqdn.com  bootkube.sh[1512]: etcdctl failed. Retrying in 5 seconds...

The bootstrap node is attempting to join the masters to it's etcd cluster. The masters will start etcd as part of their bootstrapping process and will join the cluster when ready.

Please, ensure that control planes have at least 16 GB memory. Configuring control planes with less than 16 GB, may cause OOM issues. These memory issues may cause etcd to fail, even after etcd cluster is up.


4. Master nodes pull their ignition configuration from the bootstrap

Ignition is performed in two phases:

  1. The ignition file from the installer is provided to RHCOS via an HTTP GET to an HTTP server hosting the generated ignition files. This ignition file contains a certificate authority and a URL which points to the ignition which will be hosted by the bootstrap MCO.
  2. After RHCOS installation, the master node will reboot and begin polling the bootstrap node for the ignition hosted by the bootstrap MCO.

Common Issues

  1. The MCO endpoint is not accessible to the master. Check to see if the MCO is available from the system performing the installation.
    curl https://api-int.<cluster_name>.<domain>:22623/config/master

If it is not, check:

  • Load balancer configuration to ensure that port 22623 is configured per the installation documentation NETWORK TOPOLOGY REQUIREMENTS
  • The DNS record api-int.<cluster_name>.<base_domain> is defined and resolves to the load balancer User-provisioned DNS requirements
  • Bypass the load balancer and the DNS server and check to see if the MCO is serving ignition data. From the bootstrap node, run:

curl -vik --resolve api-int.<cluster_name>.<domain>:22623:<bootstrap ip> https://api-int.<cluster_name>.<domain>:22623/config/master

2. The real-time clock(or time zone in the case of vSphere) may be inconsistent with the system that generated the ignition files. See Masters and Workers Fail to Ignite Reporting Error 'x509: certificate has expired or not yet valid'

3. Check the bootstrap.ign for expired certificates

FILES=`jq '.storage.files[].path' bootstrap.ign -r | grep crt`
for FILE in $FILES; do echo $FILE;  jq ".storage.files[] | select(.path==\"$FILE\") | .contents.source" bootstrap.ign -r | sed "s/data\:text\/plain\;charset\=utf\-8;base64,//g" | base64 -d | openssl x509 -text -noout | grep "Not Before\: \| Not After \: \| $FILE"; done

5. Master nodes start their kubelet, API server, and etcd

When the master nodes start, they will start etcd, the kubelet, and the API server.

etcd Common Issues

  • Are the DNS records for etcd defined User-provisioned DNS requirements? If not, you will see errors in the master container logs:
    2019-07-17T20:56:36.528504751+00:00 stderr F E0717 20:56:36.528381 1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.my.cluster.fqdn.com on 10.0.0.11:53: no such host
  • If there is a primary and a secondary DNS server defined, confirm that all defined DNS servers are accessible by the masters.

If the master logs are not contained in the output from install-gather.sh, they can be pulled directly by running:
ssh core@<master host name> sudo tail -f /var/log/containers/*.log

Kubelet/API Server Common Issues

  1. The DNS record for the API server is not defined or is not pointing to the load balancer
    The kubelet expects to be able to communicate with the API server. It will attempt to connect to the API server on https://api-int.<cluster_name>.<base_domain>:6443. See User-provisioned DNS requirements

  2. The load balancer does not include the master nodes in the definition for port 6443. Even though the masters are not yet available at the time the load balancer is configured, they must be included. Once the API server endpoints for the masters become available, this allows services which require access to the master API server to begin communicating with it. See NETWORK TOPOLOGY REQUIREMENTS.

  3. The host name for the node was not provided via DHCP. See Networking requirements for user-provisioned infrastructure. The host name is used to uniquely identify the node in the cluster. It must be unique.

If the host name is not configured, the host name will default to localhost.localdomain. The kubelet logs will contain the host name of the node.

Jul 15 20:41:18 localhost.localdomain hyperkube[918]: I0715 20:41:18.632600     918 reflector.go:160] Listing and watching *v1.Pod from [k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47](http://k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47 "http://k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47")
Jul 15 20:41:18 localhost.localdomain hyperkube[918]: E0715 20:41:18.634059     918 reflector.go:125] [k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47](http://k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47 "http://k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47"): Failed to list *v1.Pod: the server could not find the requested resource (get pods)
Jul 15 20:41:18 localhost.localdomain hyperkube[918]: E0715 20:41:18.635446     918 kubelet.go:2274] node "localhost.localdomain" not found
Jul 15 20:41:18 localhost.localdomain hyperkube[918]: E0715 20:41:18.735739     918 kubelet.go:2274] node "localhost.localdomain" not found
  1. Certificates have expired. Check the master kubelet logs from the entries such as:
    Jun 26 15:25:32 [<node>.<cluster_name>.<base_domain>](https://<node>.<cluster_name>.<base_domain>/) hyperkube[896]: E0626 15:25:32.806333 896 reflector.go:125] [k8s.io/kubernetes/pkg/kubelet/kubelet.go:453](https://k8s.io/kubernetes/pkg/kubelet/kubelet.go:453): Failed to list *v1.Node: Get [https://api-int.<cluster_name>.<base_domain>:6443/api/v1/nodes?fieldSelector=metadata.name%3D<node>.<cluster_name>.<base_domain>&limit=500&resourceVersion=0](https://api-int.<cluster_name>.<base_domain>:6443/api/v1/nodes?fieldSelector=metadata.name%3D<node>.<cluster_name>.<base_domain>&limit=500&resourceVersion=0): x509: certificate has expired or is not yet valid

6. openshift-install /wait-for-bootstrap waits for the completion of the bootstrap

In the context of a UPI install, openshift-install is largely reporting the progress of the installation. The "installation" actually begins the moment the bootstrap is ignited.

Common Issues

  1. openshift-install can not resolve api.<cluster_name>.<base_domain>.
  2. The load balancer is not configured to proxy connections on port 6443 to the bootstrap and master nodes

7. Bootstrap node is torn down

Once the master nodes are installed, the bootstrap node is no longer required. The bootstrap node should be removed from the load balancer.


8. Worker node(s) install RHCOS

Much like the bootstrap and master nodes, the worker nodes will install RHCOS(RHEL is an option as well but is not covered in this guide) and load their initial ignition from an accessible HTTP server.

Common Issues

  1. URL of the ignition file is invalid
    If the HTTP server address is valid, but the URL returns a 404, the RHCOS installer will not stop. Check the HTTP server logs to confirm RHCOS successfully retrieved the ignition file.

  2. The HTTP server is not hosting the ignition files
    If the HTTP server is unreachable, RHCOS will fail with a warning message.

  3. The target storage device is incorrect or inaccessible
    RHCOS installation will fail with a warning message.

  4. No IP address has been assigned to the nodes


9. Worker node(s) pull their ignition configuration from the cluster MCO

Ignition is performed in two phases:

  1. The ignition file from the installer is provided to RHCOS via an HTTP GET to an HTTP server hosting the generated ignition files. This ignition file contains a certificate authority and a URL which points to the ignition which will be hosted by the cluster MCO.
  2. After RHCOS installation, the master node will reboot and begin polling the bootstrap node for the ignition hosted by the cluster MCO.

Common Issues

  1. The MCO endpoint is not accessible to the worker. Check to see if the MCO is available from the system performing the installation.
    curl https://api-int.<cluster_name>.<domain>:22623/config/worker

If it is not, check:

SBR
Category
Components
Article Type