How to debug and troubleshoot the Baremetal IPI (Installer Provisioned Infrastructure) process in Red Hat OpenShift Container Platform 4.x

Solution Verified - Updated

Environment

Red Hat OpenShift Container Platform 4

Issue

How to debug and troubleshoot the Baremetal IPI (Installer Provisioned Infrastructure) process in Red Hat OpenShift Container Platform 4.x

  • Gathering data of the bootstrap process. Issues could be varied:
    • Bootstrap VM cannot boot up the cluster nodes
    • PXE connection issue
    • PXE not getting DHCP lease
    • etc
  • Troubleshooting and debugging the deployment of worker nodes with BaremetalHost (BMH)

Resolution

Troubleshooting OpenShift baremetal hosts during installation (master nodes)

Gathering the installer logs

When opening a support case for installation issues with OpenShift Installer Provisioned Infrastructure (IPI) deployments, make sure to generate and attach a log-bundle of the bootstrap logs.

On your bastion host run the following command:

./openshift-baremetal-install gather bootstrap --dir=<installation_directory> 

A tarball will be generated here <installation_directory>/log-bundle-<timestamp>.tar.gz
It is also helpful to send the installation debug log. If the cluster failed to bootstrap:

./openshift-baremetal-install wait-for bootstrap-complete --log-level=debug

If the cluster completed bootstrapping but failed to finish the control plane and/or failed to add worker nodes:

./openshift-baremetal-install wait-for install-complete --log-level=debug

Debugging the bootstrap VM

It is possible to connect to the bootstrap VM.

On the provisioning host list the libvirt guests:

[m3@rhospbl-9 ~]# sudo virsh list
 Id   Name                     State
----------------------------------------
 1    ostest-xz92n-bootstrap   running

Connect to the console (you will not be able to login locally), press enter to show the relevant host info:

[m3@rhospbl-9 ~]# virsh console example.com
Connected to domain ostest-xz92n-bootstrap
Escape character is ^] (Ctrl + ])

Red Hat Enterprise Linux CoreOS 47.82.202010211043-0 (Ootpa) 4.7
SSH host key: SHA256:OBGzVDyLKYF2ddoMG92SAI7rfZLLqWs7i4Pe8JN8TpU (ED25519)
SSH host key: SHA256:K7IBf2Ov37bS3HOW4bvEysxvR89HZF8By6TsCbJIIKw (ECDSA)
SSH host key: SHA256:0gD5m3EopQSHOrkNnR6VgssdKvvbcSAcpOBS0woOdmY (RSA)
ens3:  fd2e:6f44:5dd8:c956::5
ens4: 172.23.0.2 fe80::1296:143b:1052:abb7
localhost login: 

Note: If you do not see an IP address at this stage, there is likely an issue with your DHCP configuration.

Copy the IP (in this example: 172.23.0.2) and ssh to the guest (use the ssh-key pair that was provided in install-config.yaml):

[m3@rhospbl-9 ~]$ ssh core@172.23.0.2
The authenticity of host '172.23.0.2 (172.23.0.2)' can't be established.
ECDSA key fingerprint is SHA256:K7IBf2Ov37bS3HOW4bvEysxvR89HZF8By6TsCbJIIKw.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '172.23.0.2' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 47.82.202010211043-0
  Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.

  journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service
[core@localhost ~]$

Once on the bootstrap node, one can check the health of the bootstrap services:

[root@localhost ~]# podman ps
CONTAINER ID  IMAGE                                                                                                                   COMMAND               CREATED             STATUS                 PORTS  NAMES
c3b1a2bd4ad4  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ffd7d3ab4a2e5e93ca6a29f33d6996f24dd653fe9bbb57fb68b13fd97844729  start --tear-down...  About a minute ago  Up About a minute ago         sweet_proskuriakova
58cc232cbebf  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9                        41 minutes ago      Up 41 minutes ago             ironic-api
77ec2c07fffb  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:13699b90c797dfcd1d7d613d09e7ea9d88ac0cbb1f9f6af5cd503c16391b91a2                        41 minutes ago      Up 41 minutes ago             ironic-inspector
dcafd77b205e  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9                        41 minutes ago      Up 41 minutes ago             ironic-conductor
f5428671efa6  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9                        45 minutes ago      Up 45 minutes ago             httpd
7b6aa07b05d0  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9                        45 minutes ago      Up 45 minutes ago             mariadb
a9d52f15a9e6  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9                        45 minutes ago      Up 45 minutes ago             dnsmasq

Gathering the ironic services logs

Verify the various ironic container logs to identify any issues:

[root@localhost ~]# podman logs -f ironic-api
Waiting for ens4 interface to be configured
# Options set from Environment variables
OS_GIT_MINOR=7
OS_GIT_TREE_STATE=clean
OS_GIT_COMMIT=7d37c1a
OS_GIT_VERSION=4.7.0-202010192239.p0-7d37c1a
OS_GIT_MAJOR=4
OS_GIT_PATCH=0
2020-11-06 00:24:58.526 1 DEBUG oslo_db.api [-] Loading backend 'sqlalchemy' from 'ironic.db.sqlalchemy.api' _load_backend /usr/lib/python3.6/site-packages/oslo_db/api.py:261
2020-11-06 00:24:58.853 1 DEBUG ironic.cmd.api [-] Guru meditation reporting is disabled because oslo.reports is not installed main /usr/lib/python3.6/site-packages/ironic/cmd/api.py:48
2020-11-06 00:24:58.854 1 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:266
2020-11-06 00:24:58.855 1 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:282
2020-11-06 00:25:00.631 1 INFO oslo.service.wsgi [-] ironic_api listening on :::6385
(...)

The same method is used to collect logs for all 3 ironic services: ironic-api, ironic-inspector, ironic-conductor.

Gathering ironic db data

Get the random generated mariadb password:

[root@localhost ~]# export $(podman inspect -f "{{index .Config.Env 15}}" mariadb) && echo $MARIADB_PASSWORD
be4176e122b04006a7f68edf2c6c620c

From here, one can dump all the info wanted:

[root@localhost ~]# podman exec -ti mariadb mysql -u ironic --password=$MARIADB_PASSWORD ironic -e 'show tables;'
+-------------------------------+
| Tables_in_ironic              |
+-------------------------------+
| alembic_version               |
| allocations                   |
| bios_settings                 |
| chassis                       |
| conductor_hardware_interfaces |
| conductors                    |
| deploy_template_steps         |
| deploy_templates              |
| node_tags                     |
| node_traits                   |
| nodes                         |
| portgroups                    |
| ports                         |
| volume_connectors             |
| volume_targets                |
+-------------------------------+

Example shows the first node info:

[root@localhost ~]# podman exec -ti mariadb mysql -u ironic --password=$MARIADB_PASSWORD ironic -e 'select * from nodes limit 1\G'
*************************** 1. row ***************************
            created_at: 2020-11-06 00:25:33
            updated_at: 2020-11-06 01:22:47
                    id: 1
                  uuid: dee7881c-0e4f-4353-8ca8-05f64cba1200
         instance_uuid: NULL
            chassis_id: NULL
           power_state: NULL
    target_power_state: NULL
       provision_state: enroll
target_provision_state: NULL
            last_error: Failed to get power state for node dee7881c-0e4f-4353-8ca8-05f64cba1200. Error: HTTP GET http://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/eb638411-9e2e-488c-8bbe-53831b58831b returned code 500. Base.1.0.GeneralError: Error finding domain by name/UUID "eb638411-9e2e-488c-8bbe-53831b58831b" at libvirt URI qemu+ssh://root@[fd2e:6f44:5dd8:c956::1]/system?&keyfile=/root/ssh/id_rsa_virt_power&no_verify=1&no_tty=1": Domain not found: no domain with matching name 'eb638411-9e2e-488c-8bbe-53831b58831b'
            properties: {"capabilities": "boot_mode:uefi", "cpu_arch": "x86_64", "local_gb": "50", "root_device": {"name": "s== /dev/sda"}}
                driver: redfish
           driver_info: {"deploy_kernel": "http://172.23.0.2:80/images/ironic-python-agent.kernel", "deploy_ramdisk": "http://172.23.0.2:80/images/ironic-python-agent.initramfs", "redfish_address": "http://[fd2e:6f44:5dd8:c956::1]:8000", "redfish_password": "password", "redfish_system_id": "/redfish/v1/Systems/eb638411-9e2e-488c-8bbe-53831b58831b", "redfish_username": "admin"}
           reservation: NULL
           maintenance: 0
                 extra: {}
  provision_updated_at: 2020-11-06 01:22:47
       console_enabled: 0
         instance_info: {}
    conductor_affinity: NULL
    maintenance_reason: NULL
  driver_internal_info: {}
                  name: ostest-master-2
 inspection_started_at: NULL
inspection_finished_at: NULL
            clean_step: {}
           raid_config: {}
    target_raid_config: {}
     network_interface: noop
        resource_class: baremetal
        boot_interface: ipxe
     console_interface: no-console
      deploy_interface: direct
     inspect_interface: inspector
  management_interface: redfish
       power_interface: redfish
        raid_interface: no-raid
      vendor_interface: no-vendor
     storage_interface: noop
               version: 1.35
      rescue_interface: no-rescue
        bios_interface: no-bios
                 fault: NULL
           deploy_step: {}
       conductor_group: 
       automated_clean: NULL
             protected: 0
      protected_reason: NULL
                 owner: NULL
         allocation_id: NULL
           description: NULL
               retired: 0
        retired_reason: NULL
                lessee: NULL
          network_data: {}

Running ironic CLI for further troubleshooting

It is possible to directly interact with the ironic API. In order to do so, use the existing clouds.yaml file for recent versions of OpenShift or generate a valid clouds.yaml file (pre 4.7). Then, use the openstack baremetal CLI to exchange messages with the baremetal API.
The following commands are to be run on the bootstrap node.

Generating clouds.yaml

Create clouds.yaml for pre-OCP 4.7

Create a clouds.yaml (openstack CLI will use it to be able to connect to the correct endpoints, see Content from docs.openstack.org is not included.Openstack Documention for further information) file with the bootstrapProvisioningIp on the bootstrap VM. For example if the bootstrap IP is 192.168.123.25:

[root@localhost ~]# ip a ls dev ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:06:cc:28 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.227/24 brd 192.168.123.255 scope global dynamic noprefixroute ens3
       valid_lft 81708sec preferred_lft 81708sec
    inet 192.168.123.20/32 scope global ens3
       valid_lft forever preferred_lft forever
    inet 192.168.123.25/24 brd 192.168.123.255 scope global secondary noprefixroute ens3
       valid_lft forever preferred_lft forever
    inet6 fe80::1d56:df92:349c:d801/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
cat <<'EOF' > ~/.config/openstack
clouds:
  metal3-bootstrap:
    auth_type: none
    baremetal_endpoint_override: http://192.168.123.25:6385
    baremetal_introspection_endpoint_override: http://192.168.123.25:5050
EOF

Note: We create the file under the user's directory ~/.config/openstack

Use existing clouds.yaml starting with OCP 4.7

A clouds.yaml will be generated for you on the bootstrap server:

[root@bootstrap ~]# cat /var/opt/metal3/auth/clouds.yaml
clouds:
  metal3-bootstrap:
    auth_type: http_basic
    username: bootstrap-user
    password: TlR6ize2yugy8OGx
    baremetal_endpoint_override: http://192.168.123.25:6385/v1
    baremetal_introspection_endpoint_override: http://192.168.123.25:5050/v1
cp ~/.config/openstack

Spawn ironic-client container

Run the ironic-client on the bootstrap VM using podman:

cp /var/opt/metal3/auth/clouds.yaml /tmp/clouds.yaml
podman run -ti --rm --entrypoint /bin/bash -v /tmp/clouds.yaml:/clouds.yaml -e OS_CLOUD=metal3-bootstrap quay.io/metal3-io/ironic-client

Once in the container, run the following command to see the status of the nodes on Ironic:

[root@517fdfc6b248 /]# cat /clouds.yaml 
clouds:
  metal3-bootstrap:
    auth_type: none
    baremetal_endpoint_override: http://192.168.123.25:6385
    baremetal_introspection_endpoint_override: http://192.168.123.25:5050

[root@1facad6bccff /]# baremetal node list
(...)

Note: Use command baremetal (...). The openstack baremetal (...) commands will not work. For further details, see: This content is not included.This content is not included.https://bugzilla.redhat.com/show_bug.cgi?id=1937458

Troubleshooting OpenShift baremetal hosts in the main cluster (worker nodes)

Data collection for Red Hat technical support

Source the cluster's credentials (export KUBECONFIG=(... config file ...)). Then, run the below and provide file openshift-machine-api-resources.txt to Red Hat technical support:

(
oc project openshift-machine-api

echo "*** get ***"
echo "*** oc get clusterversion ***"
oc get clusterversion
echo "*** oc get co ***"
oc get co
echo "*** oc get nodes ***"
oc get nodes
echo "*** oc get machines ***"
oc get machines
echo "*** oc get bmh ***"
oc get bmh
echo "*** oc get provisioning ***"
oc get provisioning
echo "*** oc get all,events -o wide ***"
oc get all,events -o wide

echo "*** get -o yaml ***"
echo "*** oc get nodes -o yaml ***"
oc get nodes -o yaml
echo "*** oc get machines -o yaml ***"
oc get machines -o yaml
echo "*** oc get bmh -o yaml ***"
oc get bmh -o yaml
echo "*** oc get provisioning -o yaml ***"
oc get provisioning -o yaml
echo "*** oc get all,events -o yaml ***"
oc get all,events -o yaml

echo "*** describe ***"
echo "*** oc describe nodes ***"
oc describe nodes
echo "*** oc describe machines ***"
oc describe machines
echo "*** oc describe bmh ***"
oc describe bmh
echo "*** oc describe provisioning ***"
oc describe provisioning
echo "*** oc describe all,events ***"
oc describe all,events

echo "*** logs ***"
oc get pods -o name | while read p ; do echo "*** oc logs $p --all-containers ***" ;  oc logs $p --all-containers  ; done
) >  openshift-machine-api-resources.txt

Troubleshooting via ironic baremetal API

Note: The following examples use a RHEL 8 jumpserver / bastion host from where both the API endpoint as well as all master nodes can be reached.

Get the cluster's ironic configuration from the openshift-machine-api project.

[root@openshift-jumpserver-0 ~]#  oc project openshift-machine-api
Using project "openshift-machine-api" on server "https://api.ipi-cluster.example.com:6443".
[root@openshift-jumpserver-0 ~]# oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-5899b88dd7-8csd4   2/2     Running   0          104m
cluster-baremetal-operator-6d59588c55-g8x4p    1/1     Running   0          104m
machine-api-controllers-57d664864d-5xdx4       7/7     Running   1          79m
machine-api-operator-85596b4456-cbz5n          2/2     Running   1          104m
metal3-85cfb7b9c-dhfcb                         9/9     Running   0          85m
metal3-image-cache-5zktc                       1/1     Running   0          83m
metal3-image-cache-mn5km                       1/1     Running   0          83m
metal3-image-cache-n4l5t                       1/1     Running   0          83m

Connect to the metal3 pod (modify the suffix accordingly) and inspect the ironic auth-config:

oc rsh metal3-85cfb7b9c-dhfcb  
(...)
sh-4.4# cat /auth/ironic/auth-config 
[ironic]
auth_type = http_basic
username = ironic-user
password = uwbNiPcBTvCa0NRh
sh-4.4# cat /auth/ironic-inspector/auth-config 
[inspector]
auth_type = http_basic
username = inspector-user
password = CE0UzMhDXfjPzxYT

Find the location of the metal3 pod. Here, the pod is on openshift-master-2:

[root@openshift-jumpserver-0 ~]# oc get pods -o wide | grep metal3-85cfb7b9c-kr4qg
metal3-85cfb7b9c-kr4qg                         9/9     Running   0          12h   192.168.123.202   openshift-master-2   <none>           <none>

Make sure that the master node in question can be reached from the jumpserver / bastion host on ports 6385 and 5050.

telnet openshift-master-2.example.com 6385
telnet openshift-master-2.example.com 5050

Red Hat OpenShift Container Platform 4.7

Create a file called clouds.yaml and point it to the DNS entry or IP address of the master node in question. In this example, openshift-master-2 can be reached via DNS entry openshift-master-2.example.com.

cat <<'EOF' > clouds.yaml
clouds:
  metal3:
    auth_type: http_basic
    username: ironic-user
    password: uwbNiPcBTvCa0NRh
    baremetal_endpoint_override: http://openshift-master-2.example.com:6385/v1
    baremetal_introspection_endpoint_override: http://openshift-master-2.example.com:5050/v1
EOF

Now, on the jumpserver / bastion host, start virtualenv with the OpenStack client:

yum install python3-virtualenv -y
virtualenv openstack
source openstack/bin/activate
pip install python-openstackclient
pip install python-ironicclient
export OS_CLOUD=metal3

Now, it is possible to query ironic directly via openstack baremetal.

openstack baremetal node list

Example output:

(openstack) [root@openshift-jumpserver-0 ~]# openstack baremetal node list
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name               | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
| d54dc477-0c0a-476d-bfea-f74c094fb8ed | openshift-worker-0 | dd0e2cd7-5d63-469e-97cc-474df89ec285 | power on    | active             | False       |
| c974b83f-8130-49d6-8abb-ec613bd7fc8a | openshift-worker-1 | 6a335d63-ea7f-4289-a36e-e688e3edb597 | power on    | active             | False       |
| 0aa82bab-df95-4b15-98b9-36d887a3736a | openshift-master-2 | 14a7e120-a08e-4a2f-a3aa-6086de6843de | power on    | active             | False       |
| ebad307b-0255-4b15-a2ef-c23d626b5d94 | openshift-master-0 | 30522611-b050-49fe-abd8-cfdcf7f81b04 | power on    | active             | False       |
| 520d0860-b741-4186-a72c-165c0f2e7d71 | openshift-master-1 | 124fedcc-d131-4c6f-8d96-e644bb4dfa90 | power on    | active             | False       |
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+

Red Hat OpenShift Container Platform 4.8 and beyond

Create a file called clouds.yaml and point it to the DNS entry or IP address of the master node in question. In this example, openshift-master-2 can be reached via DNS entry openshift-master-2.example.com.

cat <<'EOF' > clouds.yaml
clouds:
  metal3:
    auth_type: http_basic
    username: ironic-user
    password: uwbNiPcBTvCa0NRh
    baremetal_endpoint_override: https://openshift-master-2.example.com:6385/v1
    baremetal_introspection_endpoint_override: https://openshift-master-2.example.com:5050/v1
EOF

While it is possible to use virtualenv and install the needed OpenStack clients with pip just like for 4.7, it is more convenient to simply run a container with the ironic-client container image:

podman run -ti --rm --entrypoint /bin/bash -v /path/to/clouds.yaml:/clouds.yaml:z -e OS_CLOUD=metal3 quay.io/metal3-io/ironic-client

Make sure that clouds.yaml is readable from inside the container. Otherwise, make sure that SELinux and/or file permissions do not interfere.

Now, it is possible to query ironic directly via baremetal.

baremetal --insecure node list

Example output:

(openstack) [root@openshift-jumpserver-0 ~]# baremetal --insecure node list
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name                                     | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| a46bbc88-a695-4bb2-ac2a-29437bc0bdfc | openshift-machine-api~openshift-worker-2 | None                                 | power on    | inspect failed     | False       |
| 4fa29666-c73c-48cf-a1c9-88f71f582c75 | openshift-machine-api~openshift-worker-1 | None                                 | power on    | inspect failed     | False       |
| 710be8cf-0afe-4ce3-a511-d3fb1ba77318 | openshift-machine-api~openshift-master-2 | 63ded5c6-4d05-4522-9348-b162291e5d62 | power on    | active             | False       |
| 6b8590c3-baf4-482e-bc61-512557d95055 | openshift-machine-api~openshift-master-1 | 3890288f-1016-4bdf-8313-71f4655181c5 | power on    | active             | False       |
| 76cc4af7-d8a2-4a43-a915-ba6d97149f99 | openshift-machine-api~openshift-master-0 | 516c1f3f-4af5-45b8-ba82-e6245898ed09 | power on    | active             | False       |
| 7e2219a5-dd7b-4cae-92b1-6a86d745134e | openshift-machine-api~openshift-worker-0 | None                                 | power on    | inspect wait       | False       |
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.