Host going non operational with CPU compatibility error while upgrading hosted engine to 4.4 from 4.3
Environment
- Red Hat Virtualization 4.4.
- Red Hat Enterprise Linux 8.3
Issue
- While upgrading the
hosted engineenvironment to 4.4 from 4.3, the host is going intonon operationalstatus with below error.
2020-12-16 22:38:16,591Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) [205acb5b] EVENT_ID: VDS_CPU_LOWER_THAN_CLUSTER(515), Host test.example.com moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : model_Skylake-Server
- In 4.3, the cluster was having CPU family
Intel Skylake Server IBRS SSBD MDS. - Node where the hosted-engine deployment script was executed is 4.4.3 host.
Resolution
- A temporary workaround is re-enable TSX on the RHEL 8.3 host, but once the upgrade/installation is complete please revert to disabling TSX and using Secure CPU types.
1. Enable TSX on the RHEL 8.3 hypervisor:
$ grubby --update-kernel=ALL --args="tsx=on"
2. Clear libvirt qemu capabilities cache, this is to avoid This content is not included.Private Bug 1953389 - libvirt qemu capabilities cache not invalidated after TSX enable/disable:
$ rm /var/cache/libvirt/qemu/capabilities/*.xml
3. Depending on the CPU model, it might be needed to downgrade the microcode package to a version that does not disable TSX. In that's the case, lock the package version and rebuild the initramfs:
# dnf downgrade microcode_ctl-20191115-4.el8.x86_64
# dnf install 'dnf-command(versionlock)'
# dnf versionlock add microcode_ctl
# dracut -f
4. Reboot the hypervisor and try again.
5. Once the upgrade or deployment is finished, switch the Cluster CPU type to a Secure variant. Please note that even the non-Secure cluster CPU models disable TSX.
6. Power cycle all VMs so they stop using TSX and then finally revert to disabling TSX on all hypervisors.
Root Cause
- Starting from RHEL 8.3, the default is to disable TSX. This means the qemu CPU variants with TSX (these are the ones without 'noTSX' suffix) are not usable.
- when a 4.5 compatible host is used for the upgrade, the deployment script is adding this host to a 4.3 cluster. Engine checks for
model_Skylake-Serverin the CPU for 4.3 cluster but 8.3 host's libvirtd will only returnSkylake-Server-noTSX. So the host goes into non operational status. - In 8.3 host, Skylake-Server is marked as
usable=no. So the check in the engine will fail. - The CPU microcode provided in RHEL >= 8.3 disables TSX.
- The issue is reported in This content is not included.bug 1905158.
Diagnostic Steps
- The
virsh domcapabilitiesfor RHEL 8.3 with default options is:
# virsh domcapabilities
<feature policy='disable' name='hle'/>
<feature policy='disable' name='rtm'/>
...
<model usable='yes'>Skylake-Server-noTSX-IBRS</model>
<model usable='no'>Skylake-Server-IBRS</model>
...
- TSX is disabled on RHEL 8.3 and higher.
$ cat /sys/devices/system/cpu/vulnerabilities/tsx_async_abort
Mitigation: TSX disabled
- While on 8.2 and lower, where TSX is not disabled by default:
# virsh domcapabilities
<feature policy='require' name='hle'/>
<feature policy='require' name='rtm'/>
...
<model usable='yes'>Skylake-Server-noTSX-IBRS</model>
<model usable='yes'>Skylake-Server-IBRS</model>
...
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.