How to debug and troubleshoot the Baremetal IPI (Installer Provisioned Infrastructure) process in Red Hat OpenShift Container Platform 4.x
Environment
Red Hat OpenShift Container Platform 4
Issue
How to debug and troubleshoot the Baremetal IPI (Installer Provisioned Infrastructure) process in Red Hat OpenShift Container Platform 4.x
- Gathering data of the bootstrap process. Issues could be varied:
- Bootstrap VM cannot boot up the cluster nodes
- PXE connection issue
- PXE not getting DHCP lease
- etc
- Troubleshooting and debugging the deployment of worker nodes with BaremetalHost (BMH)
Resolution
Troubleshooting OpenShift baremetal hosts during installation (master nodes)
Gathering the installer logs
When opening a support case for installation issues with OpenShift Installer Provisioned Infrastructure (IPI) deployments, make sure to generate and attach a log-bundle of the bootstrap logs.
On your bastion host run the following command:
./openshift-baremetal-install gather bootstrap --dir=<installation_directory>
A tarball will be generated here <installation_directory>/log-bundle-<timestamp>.tar.gz
It is also helpful to send the installation debug log. If the cluster failed to bootstrap:
./openshift-baremetal-install wait-for bootstrap-complete --log-level=debug
If the cluster completed bootstrapping but failed to finish the control plane and/or failed to add worker nodes:
./openshift-baremetal-install wait-for install-complete --log-level=debug
Debugging the bootstrap VM
It is possible to connect to the bootstrap VM.
On the provisioning host list the libvirt guests:
[m3@rhospbl-9 ~]# sudo virsh list
Id Name State
----------------------------------------
1 ostest-xz92n-bootstrap running
Connect to the console (you will not be able to login locally), press enter to show the relevant host info:
[m3@rhospbl-9 ~]# virsh console example.com
Connected to domain ostest-xz92n-bootstrap
Escape character is ^] (Ctrl + ])
Red Hat Enterprise Linux CoreOS 47.82.202010211043-0 (Ootpa) 4.7
SSH host key: SHA256:OBGzVDyLKYF2ddoMG92SAI7rfZLLqWs7i4Pe8JN8TpU (ED25519)
SSH host key: SHA256:K7IBf2Ov37bS3HOW4bvEysxvR89HZF8By6TsCbJIIKw (ECDSA)
SSH host key: SHA256:0gD5m3EopQSHOrkNnR6VgssdKvvbcSAcpOBS0woOdmY (RSA)
ens3: fd2e:6f44:5dd8:c956::5
ens4: 172.23.0.2 fe80::1296:143b:1052:abb7
localhost login:
Note: If you do not see an IP address at this stage, there is likely an issue with your DHCP configuration.
Copy the IP (in this example: 172.23.0.2) and ssh to the guest (use the ssh-key pair that was provided in install-config.yaml):
[m3@rhospbl-9 ~]$ ssh core@172.23.0.2
The authenticity of host '172.23.0.2 (172.23.0.2)' can't be established.
ECDSA key fingerprint is SHA256:K7IBf2Ov37bS3HOW4bvEysxvR89HZF8By6TsCbJIIKw.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '172.23.0.2' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 47.82.202010211043-0
Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html
---
This is the bootstrap node; it will be destroyed when the master is fully up.
The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.
journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
NetworkManager-wait-online.service
[core@localhost ~]$
Once on the bootstrap node, one can check the health of the bootstrap services:
[root@localhost ~]# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c3b1a2bd4ad4 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ffd7d3ab4a2e5e93ca6a29f33d6996f24dd653fe9bbb57fb68b13fd97844729 start --tear-down... About a minute ago Up About a minute ago sweet_proskuriakova
58cc232cbebf quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9 41 minutes ago Up 41 minutes ago ironic-api
77ec2c07fffb quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:13699b90c797dfcd1d7d613d09e7ea9d88ac0cbb1f9f6af5cd503c16391b91a2 41 minutes ago Up 41 minutes ago ironic-inspector
dcafd77b205e quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9 41 minutes ago Up 41 minutes ago ironic-conductor
f5428671efa6 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9 45 minutes ago Up 45 minutes ago httpd
7b6aa07b05d0 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9 45 minutes ago Up 45 minutes ago mariadb
a9d52f15a9e6 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3054b63cf3353f5aaf4aaec821a37109b6558174b299265fd0f86bcb3b185a9 45 minutes ago Up 45 minutes ago dnsmasq
Gathering the ironic services logs
Verify the various ironic container logs to identify any issues:
[root@localhost ~]# podman logs -f ironic-api
Waiting for ens4 interface to be configured
# Options set from Environment variables
OS_GIT_MINOR=7
OS_GIT_TREE_STATE=clean
OS_GIT_COMMIT=7d37c1a
OS_GIT_VERSION=4.7.0-202010192239.p0-7d37c1a
OS_GIT_MAJOR=4
OS_GIT_PATCH=0
2020-11-06 00:24:58.526 1 DEBUG oslo_db.api [-] Loading backend 'sqlalchemy' from 'ironic.db.sqlalchemy.api' _load_backend /usr/lib/python3.6/site-packages/oslo_db/api.py:261
2020-11-06 00:24:58.853 1 DEBUG ironic.cmd.api [-] Guru meditation reporting is disabled because oslo.reports is not installed main /usr/lib/python3.6/site-packages/ironic/cmd/api.py:48
2020-11-06 00:24:58.854 1 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:266
2020-11-06 00:24:58.855 1 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:282
2020-11-06 00:25:00.631 1 INFO oslo.service.wsgi [-] ironic_api listening on :::6385
(...)
The same method is used to collect logs for all 3 ironic services: ironic-api, ironic-inspector, ironic-conductor.
Gathering ironic db data
Get the random generated mariadb password:
[root@localhost ~]# export $(podman inspect -f "{{index .Config.Env 15}}" mariadb) && echo $MARIADB_PASSWORD
be4176e122b04006a7f68edf2c6c620c
From here, one can dump all the info wanted:
[root@localhost ~]# podman exec -ti mariadb mysql -u ironic --password=$MARIADB_PASSWORD ironic -e 'show tables;'
+-------------------------------+
| Tables_in_ironic |
+-------------------------------+
| alembic_version |
| allocations |
| bios_settings |
| chassis |
| conductor_hardware_interfaces |
| conductors |
| deploy_template_steps |
| deploy_templates |
| node_tags |
| node_traits |
| nodes |
| portgroups |
| ports |
| volume_connectors |
| volume_targets |
+-------------------------------+
Example shows the first node info:
[root@localhost ~]# podman exec -ti mariadb mysql -u ironic --password=$MARIADB_PASSWORD ironic -e 'select * from nodes limit 1\G'
*************************** 1. row ***************************
created_at: 2020-11-06 00:25:33
updated_at: 2020-11-06 01:22:47
id: 1
uuid: dee7881c-0e4f-4353-8ca8-05f64cba1200
instance_uuid: NULL
chassis_id: NULL
power_state: NULL
target_power_state: NULL
provision_state: enroll
target_provision_state: NULL
last_error: Failed to get power state for node dee7881c-0e4f-4353-8ca8-05f64cba1200. Error: HTTP GET http://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/eb638411-9e2e-488c-8bbe-53831b58831b returned code 500. Base.1.0.GeneralError: Error finding domain by name/UUID "eb638411-9e2e-488c-8bbe-53831b58831b" at libvirt URI qemu+ssh://root@[fd2e:6f44:5dd8:c956::1]/system?&keyfile=/root/ssh/id_rsa_virt_power&no_verify=1&no_tty=1": Domain not found: no domain with matching name 'eb638411-9e2e-488c-8bbe-53831b58831b'
properties: {"capabilities": "boot_mode:uefi", "cpu_arch": "x86_64", "local_gb": "50", "root_device": {"name": "s== /dev/sda"}}
driver: redfish
driver_info: {"deploy_kernel": "http://172.23.0.2:80/images/ironic-python-agent.kernel", "deploy_ramdisk": "http://172.23.0.2:80/images/ironic-python-agent.initramfs", "redfish_address": "http://[fd2e:6f44:5dd8:c956::1]:8000", "redfish_password": "password", "redfish_system_id": "/redfish/v1/Systems/eb638411-9e2e-488c-8bbe-53831b58831b", "redfish_username": "admin"}
reservation: NULL
maintenance: 0
extra: {}
provision_updated_at: 2020-11-06 01:22:47
console_enabled: 0
instance_info: {}
conductor_affinity: NULL
maintenance_reason: NULL
driver_internal_info: {}
name: ostest-master-2
inspection_started_at: NULL
inspection_finished_at: NULL
clean_step: {}
raid_config: {}
target_raid_config: {}
network_interface: noop
resource_class: baremetal
boot_interface: ipxe
console_interface: no-console
deploy_interface: direct
inspect_interface: inspector
management_interface: redfish
power_interface: redfish
raid_interface: no-raid
vendor_interface: no-vendor
storage_interface: noop
version: 1.35
rescue_interface: no-rescue
bios_interface: no-bios
fault: NULL
deploy_step: {}
conductor_group:
automated_clean: NULL
protected: 0
protected_reason: NULL
owner: NULL
allocation_id: NULL
description: NULL
retired: 0
retired_reason: NULL
lessee: NULL
network_data: {}
Running ironic CLI for further troubleshooting
It is possible to directly interact with the ironic API. In order to do so, use the existing clouds.yaml file for recent versions of OpenShift or generate a valid clouds.yaml file (pre 4.7). Then, use the openstack baremetal CLI to exchange messages with the baremetal API.
The following commands are to be run on the bootstrap node.
Generating clouds.yaml
Create clouds.yaml for pre-OCP 4.7
Create a clouds.yaml (openstack CLI will use it to be able to connect to the correct endpoints, see Content from docs.openstack.org is not included.Openstack Documention for further information) file with the bootstrapProvisioningIp on the bootstrap VM. For example if the bootstrap IP is 192.168.123.25:
[root@localhost ~]# ip a ls dev ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:06:cc:28 brd ff:ff:ff:ff:ff:ff
inet 192.168.123.227/24 brd 192.168.123.255 scope global dynamic noprefixroute ens3
valid_lft 81708sec preferred_lft 81708sec
inet 192.168.123.20/32 scope global ens3
valid_lft forever preferred_lft forever
inet 192.168.123.25/24 brd 192.168.123.255 scope global secondary noprefixroute ens3
valid_lft forever preferred_lft forever
inet6 fe80::1d56:df92:349c:d801/64 scope link noprefixroute
valid_lft forever preferred_lft forever
cat <<'EOF' > ~/.config/openstack
clouds:
metal3-bootstrap:
auth_type: none
baremetal_endpoint_override: http://192.168.123.25:6385
baremetal_introspection_endpoint_override: http://192.168.123.25:5050
EOF
Note: We create the file under the user's directory ~/.config/openstack
Use existing clouds.yaml starting with OCP 4.7
A clouds.yaml will be generated for you on the bootstrap server:
[root@bootstrap ~]# cat /var/opt/metal3/auth/clouds.yaml
clouds:
metal3-bootstrap:
auth_type: http_basic
username: bootstrap-user
password: TlR6ize2yugy8OGx
baremetal_endpoint_override: http://192.168.123.25:6385/v1
baremetal_introspection_endpoint_override: http://192.168.123.25:5050/v1
cp ~/.config/openstack
Spawn ironic-client container
Run the ironic-client on the bootstrap VM using podman:
cp /var/opt/metal3/auth/clouds.yaml /tmp/clouds.yaml
podman run -ti --rm --entrypoint /bin/bash -v /tmp/clouds.yaml:/clouds.yaml -e OS_CLOUD=metal3-bootstrap quay.io/metal3-io/ironic-client
Once in the container, run the following command to see the status of the nodes on Ironic:
[root@517fdfc6b248 /]# cat /clouds.yaml
clouds:
metal3-bootstrap:
auth_type: none
baremetal_endpoint_override: http://192.168.123.25:6385
baremetal_introspection_endpoint_override: http://192.168.123.25:5050
[root@1facad6bccff /]# baremetal node list
(...)
Note: Use command
baremetal (...). Theopenstack baremetal (...)commands will not work. For further details, see: This content is not included.This content is not included.https://bugzilla.redhat.com/show_bug.cgi?id=1937458
Troubleshooting OpenShift baremetal hosts in the main cluster (worker nodes)
Data collection for Red Hat technical support
Source the cluster's credentials (export KUBECONFIG=(... config file ...)). Then, run the below and provide file openshift-machine-api-resources.txt to Red Hat technical support:
(
oc project openshift-machine-api
echo "*** get ***"
echo "*** oc get clusterversion ***"
oc get clusterversion
echo "*** oc get co ***"
oc get co
echo "*** oc get nodes ***"
oc get nodes
echo "*** oc get machines ***"
oc get machines
echo "*** oc get bmh ***"
oc get bmh
echo "*** oc get provisioning ***"
oc get provisioning
echo "*** oc get all,events -o wide ***"
oc get all,events -o wide
echo "*** get -o yaml ***"
echo "*** oc get nodes -o yaml ***"
oc get nodes -o yaml
echo "*** oc get machines -o yaml ***"
oc get machines -o yaml
echo "*** oc get bmh -o yaml ***"
oc get bmh -o yaml
echo "*** oc get provisioning -o yaml ***"
oc get provisioning -o yaml
echo "*** oc get all,events -o yaml ***"
oc get all,events -o yaml
echo "*** describe ***"
echo "*** oc describe nodes ***"
oc describe nodes
echo "*** oc describe machines ***"
oc describe machines
echo "*** oc describe bmh ***"
oc describe bmh
echo "*** oc describe provisioning ***"
oc describe provisioning
echo "*** oc describe all,events ***"
oc describe all,events
echo "*** logs ***"
oc get pods -o name | while read p ; do echo "*** oc logs $p --all-containers ***" ; oc logs $p --all-containers ; done
) > openshift-machine-api-resources.txt
Troubleshooting via ironic baremetal API
Note: The following examples use a RHEL 8 jumpserver / bastion host from where both the API endpoint as well as all master nodes can be reached.
Get the cluster's ironic configuration from the openshift-machine-api project.
[root@openshift-jumpserver-0 ~]# oc project openshift-machine-api
Using project "openshift-machine-api" on server "https://api.ipi-cluster.example.com:6443".
[root@openshift-jumpserver-0 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-5899b88dd7-8csd4 2/2 Running 0 104m
cluster-baremetal-operator-6d59588c55-g8x4p 1/1 Running 0 104m
machine-api-controllers-57d664864d-5xdx4 7/7 Running 1 79m
machine-api-operator-85596b4456-cbz5n 2/2 Running 1 104m
metal3-85cfb7b9c-dhfcb 9/9 Running 0 85m
metal3-image-cache-5zktc 1/1 Running 0 83m
metal3-image-cache-mn5km 1/1 Running 0 83m
metal3-image-cache-n4l5t 1/1 Running 0 83m
Connect to the metal3 pod (modify the suffix accordingly) and inspect the ironic auth-config:
oc rsh metal3-85cfb7b9c-dhfcb
(...)
sh-4.4# cat /auth/ironic/auth-config
[ironic]
auth_type = http_basic
username = ironic-user
password = uwbNiPcBTvCa0NRh
sh-4.4# cat /auth/ironic-inspector/auth-config
[inspector]
auth_type = http_basic
username = inspector-user
password = CE0UzMhDXfjPzxYT
Find the location of the metal3 pod. Here, the pod is on openshift-master-2:
[root@openshift-jumpserver-0 ~]# oc get pods -o wide | grep metal3-85cfb7b9c-kr4qg
metal3-85cfb7b9c-kr4qg 9/9 Running 0 12h 192.168.123.202 openshift-master-2 <none> <none>
Make sure that the master node in question can be reached from the jumpserver / bastion host on ports 6385 and 5050.
telnet openshift-master-2.example.com 6385
telnet openshift-master-2.example.com 5050
Red Hat OpenShift Container Platform 4.7
Create a file called clouds.yaml and point it to the DNS entry or IP address of the master node in question. In this example, openshift-master-2 can be reached via DNS entry openshift-master-2.example.com.
cat <<'EOF' > clouds.yaml
clouds:
metal3:
auth_type: http_basic
username: ironic-user
password: uwbNiPcBTvCa0NRh
baremetal_endpoint_override: http://openshift-master-2.example.com:6385/v1
baremetal_introspection_endpoint_override: http://openshift-master-2.example.com:5050/v1
EOF
Now, on the jumpserver / bastion host, start virtualenv with the OpenStack client:
yum install python3-virtualenv -y
virtualenv openstack
source openstack/bin/activate
pip install python-openstackclient
pip install python-ironicclient
export OS_CLOUD=metal3
Now, it is possible to query ironic directly via openstack baremetal.
openstack baremetal node list
Example output:
(openstack) [root@openshift-jumpserver-0 ~]# openstack baremetal node list
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
| d54dc477-0c0a-476d-bfea-f74c094fb8ed | openshift-worker-0 | dd0e2cd7-5d63-469e-97cc-474df89ec285 | power on | active | False |
| c974b83f-8130-49d6-8abb-ec613bd7fc8a | openshift-worker-1 | 6a335d63-ea7f-4289-a36e-e688e3edb597 | power on | active | False |
| 0aa82bab-df95-4b15-98b9-36d887a3736a | openshift-master-2 | 14a7e120-a08e-4a2f-a3aa-6086de6843de | power on | active | False |
| ebad307b-0255-4b15-a2ef-c23d626b5d94 | openshift-master-0 | 30522611-b050-49fe-abd8-cfdcf7f81b04 | power on | active | False |
| 520d0860-b741-4186-a72c-165c0f2e7d71 | openshift-master-1 | 124fedcc-d131-4c6f-8d96-e644bb4dfa90 | power on | active | False |
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
Red Hat OpenShift Container Platform 4.8 and beyond
Create a file called clouds.yaml and point it to the DNS entry or IP address of the master node in question. In this example, openshift-master-2 can be reached via DNS entry openshift-master-2.example.com.
cat <<'EOF' > clouds.yaml
clouds:
metal3:
auth_type: http_basic
username: ironic-user
password: uwbNiPcBTvCa0NRh
baremetal_endpoint_override: https://openshift-master-2.example.com:6385/v1
baremetal_introspection_endpoint_override: https://openshift-master-2.example.com:5050/v1
EOF
While it is possible to use virtualenv and install the needed OpenStack clients with pip just like for 4.7, it is more convenient to simply run a container with the ironic-client container image:
podman run -ti --rm --entrypoint /bin/bash -v /path/to/clouds.yaml:/clouds.yaml:z -e OS_CLOUD=metal3 quay.io/metal3-io/ironic-client
Make sure that
clouds.yamlis readable from inside the container. Otherwise, make sure that SELinux and/or file permissions do not interfere.
Now, it is possible to query ironic directly via baremetal.
baremetal --insecure node list
Example output:
(openstack) [root@openshift-jumpserver-0 ~]# baremetal --insecure node list
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| a46bbc88-a695-4bb2-ac2a-29437bc0bdfc | openshift-machine-api~openshift-worker-2 | None | power on | inspect failed | False |
| 4fa29666-c73c-48cf-a1c9-88f71f582c75 | openshift-machine-api~openshift-worker-1 | None | power on | inspect failed | False |
| 710be8cf-0afe-4ce3-a511-d3fb1ba77318 | openshift-machine-api~openshift-master-2 | 63ded5c6-4d05-4522-9348-b162291e5d62 | power on | active | False |
| 6b8590c3-baf4-482e-bc61-512557d95055 | openshift-machine-api~openshift-master-1 | 3890288f-1016-4bdf-8313-71f4655181c5 | power on | active | False |
| 76cc4af7-d8a2-4a43-a915-ba6d97149f99 | openshift-machine-api~openshift-master-0 | 516c1f3f-4af5-45b8-ba82-e6245898ed09 | power on | active | False |
| 7e2219a5-dd7b-4cae-92b1-6a86d745134e | openshift-machine-api~openshift-worker-0 | None | power on | inspect wait | False |
+--------------------------------------+------------------------------------------+--------------------------------------+-------------+--------------------+-------------+
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.