Host is Non-Operational because missing CPU features in RHV cluster version 4.4 with Secure Intel Cascadelake-Server CPU type

Solution Verified - Updated

Environment

  • Red Hat Virtualization (RHV) 4.4
  • Red Hat Virtualization Manager (RHV-M) 4.4.4
  • Red Hat Virtualization Host (RHVH) upgraded from 4.4.2 to 4.4.4
  • Cluster compatibility version 4.4
  • Cluster CPU type: Secure Intel Cascadelake Server Family

Issue

After updating a host to RHVH >= 4.4.3, it is moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : model_Cascadelake-Server

Resolution

The CPU type of the cluster is not being updated to use the noTSX version. To force the update of the cluster CPU version, follow these steps:

  • Verify that the cluster compatibility version is >= 4.4.

  • Edit the cluster and set the cluster CPU model to an older model of the 'Secure' type, for example 'Secure Intel Skylake server'

  • Edit the cluster again and set the correct model for your hardware using the 'Secure' type, in this example 'Secure Intel Cascadelake server'.

  • In RHV-M, verify that the cluster is now using the '-noTSX' version.

    #  /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select name, cpu_name, cpu_flags, compatibility_version from cluster;"
         name      |                cpu_name                |             cpu_flags              | compatibility_version
    ---------------+----------------------------------------+------------------------------------+-----------------------
       Default     | Secure Intel Cascadelake Server Family | vmx,model_Cascadelake-Server-noTSX | 4.4
    (1 row)
    
  • Power off and power on all VMs running with the old CPU flags.

Root Cause

  • Intel released a new microcode version removing the TSX instructions from the CPU. See KCS 5659891 for more information.
  • The CPU flags of the cluster are not automatically upgraded if the cluster contains any hosts with a status other than "UP" or "Preparing for Maintenance". So the cluster will be having CPU without noTSX but the host will be having CPU with noTSX.
The host is having CPU with noTSX:

grep model sos_commands/virsh/virsh_-r_capabilities |head -1
      <model> grep model sos_commands/virsh/virsh_-r_capabilities |head -1
      <model>Cascadelake-Server-noTSX</model> </model>

The DB is having CPU without noTSX.

engine=> select name, cpu_name, cpu_flags, compatibility_version from cluster;
  name   |                cpu_name                |                  cpu_flags                   | compatibility_version
---------+----------------------------------------+----------------------------------------------+-----------------------
 Default | Secure Intel Cascadelake Server Family | vmx,mds-no,md-clear,model_Cascadelake-Server | 4.4

Diagnostic Steps

Even if RHV-M is updated to 4.4.4, the cluster compatibility version is 4.4 and the cluster CPU type is 'Secure Intel Cascadelake Server Family', the used cpu_flags have a CPU model without '-noTSX':

# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select name, cpu_name, cpu_flags, compatibility_version from cluster;"
     name      |                cpu_name                |                  cpu_flags                   | compatibility_version 
---------------+----------------------------------------+----------------------------------------------+-----------------------
    Default    | Secure Intel Cascadelake Server Family | vmx,mds-no,md-clear,model_Cascadelake-Server | 4.4
(1 row)

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.