Guidance on Intel TSX impact on OpenStack guests
Environment
The following Red Hat OpenStack Platform environments are affected:
- Red Hat Openstack Platform 10
- Red Hat Openstack Platform 13
- Red Hat Openstack Platform 16.1
- Red Hat Openstack Platform 16.2
Issue
With the Intel June 2021 Microcode Update Intel has disabled and removed "TSX" (Transactional Synchronization Extensions) feature in their latest microcode update. This change has been backported to all supported Red Hat Enterprise Linux releases. The Intel June 2021 Microcode Update was introduced for the following reasons:
- it is a preemptive measure against potential future security flaws
- it alleviates the performance penalty of TAA (TSX Asynchronous Abort) mitigations on Intel CascadeLake servers
However, disabling TSX can cause live migration to break in certain scenarios. On RHOSP environments 10, 13, and 16.1: live migration breaks only when you explicitly disable TSX (i.e. add tsx=off to the command-line) on some nodes — because by default on RHEL 7 and upto RHEL-8.2 is TSX is enabled. So make sure to have your TSX setting consistent on these environments. On RHOSP 16.2 (running RHEL 8.4), live migration breaks only when user explicitly enables TSX ("tsx=on") on the kernel command-line. This article outlines how Red Hat OpenStack addresses this issue.
This impact applies only to Intel hosts that support the TSX feature. For more information about the CPUs that are affected by this issue, see Affected Configurations.
Resolution
Impacted RHOSP scenarios
- Compute nodes that use one of the affected Intel CPUs
- TSX is not explicitly disabled on compute nodes using kernel boot parameter, there is no explicit definition of
tsx=offvia the THTKernelArgsparameter - Necessity to live migrate workload between Compute nodes during updates and upgrades
- IMPORTANT. Downtime must be scheduled for Nova instances before or during 16.1 -> 16.2 upgrade or fast forward upgrade procedure.
Affected Intel CPUs
- Haswell and Broadwell → Not affected, libvirt/QEMU disabled TSX for these models back in 2015
- Skylake-Client and Skylake Server → When you set the ‘tsx=on’ argument, it has no additional performance penalty, when compared to setting the
tsx=offargument. This happens due to other mitigations for Spectre-related side-channel bugs. - Cascade Lake (server and client) → When you set the ‘tsx=on’ argument, it slows down the host NB: There is an inherent performance penalty regardless of whether the guest is using TSX or not.
For more information about the CPUs that are affected by this issue, see Affected Configurations.
Affected workloads
The most common workloads that might be affected are the SAP HANA-style workloads.
Impact on live migration
Instances that are booted on Compute nodes where the TSX kernel argument is enabled can only be live migrated to other nodes where the TSX kernel argument is enabled. Compute nodes booted on RHEL 8.2 and earlier (including RHEL 7.x) run a TSX-enabled kernel by default. Compute nodes running on RHEL 8.3 and later run with the TSX kernel argument disabled by default.
As noted in the "Issue" section, live migration is impacted only when the TSX setting is inconsistent between source and destination nodes.
Upgrades and "hybrid state"
In RHOSP, "hybrid state" refers to the deployment stage where some nodes are still running on RHOSP release N, while other nodes have been upgraded to release N+1. This hybrid state allows operators to upgrade Compute nodes without any running instances to the latest RHOSP release without impacting their workloads. After the nodes have been upgraded, operators can live migrate instances to the freshly upgraded Compute nodes, and then repeat the upgrade on the remaining nodes.
Again, here too: before live-migrating, make sure the TSX setting on the kernel is consistent across both source and destination hosts.
Minor updates
While performing minor updates from RHOSP 16.1 based on RHEL 8.2 to RHOSP 16.2 based on RHEL 8.4, operators will not be able to live migrate instances between nodes where TSX settings are inconsistent, for example it will be impossible to live migrate from a source node where TSX is on to a destination where TSX is off. It is possible to explicitly disable TSX flag before upgrade and ensure that live migration will work during minor update procedure:
-
On the compute nodes with enabled TSX, set the TSX kernel argument to
tsx=offby appendingtsx=offto existing KernelArgs definition for Commpute role or adding the following THT configuration to templates (if there is no existing one) and running deployment command. IMPORTANT Latest RHOSP 16.1 and 16.2 minor releases are configured to not reboot when TSX flag is set, but older minor releases could be affected and some KernelArgs definitions could actually trigger reboot during deployment.parameter_defaults: ComputeParameters: KernelArgs: "tsx=off" -
For the first Compute node that wasn't rebooted yet (so TSX wasn't disabled yet), you can live migrate your instance to a destination Compute node that still has TSX enabled (because none of the other Compute nodes have been rebooted).
-
Reboot the first Compute node.
-
For any subsequent Compute node(s), you can live migrate your instance to any other destination Compute node that has not yet been rebooted. If the destination Compute node has been rebooted, then it is only possible to cold migrate instances; as a result, some downtime is expected for ALL instances during cold migration.
-
Reboot the subsequent Compute node(s).
Repeat these steps for all the Compute nodes in your environment.
IMPORTANT There is no need to take steps described above if it is OK to to cold migrate workflows during minor upgrade from RHEL 8.2 with enabled TSX flag to RHEL 8.4. In this case similar steps to migrate instances and reboot compute nodes should be taken when computes are rebooted.
For more information about cold migration, see Cold migrating an instance for OSP 16.1 or Cold migrating an instance for OSP 13 or Cold Migrate a Virtual Machine for OSP 10.
References
- Content from qemu.readthedocs.io is not included.QEMU documentation on CPU models. (Refer to notes on the
tsx-ctrlflag.)
Root Cause
Intel has disabled and removed the "TSX" (Transactional Synchronization Extensions) feature in their latest microcode update. This was done to address the performance penalty that was introduced by mitigations for TSX Asynchronous Abort (TAA).
For more information, refer to Red Hat guidance on Transactional Synchronization Extensions (TSX) Asynchronous Abort
Diagnostic Steps
About the TSX performance impact
To avoid confusion it is important to know that just because TSX is enabled on the host, or even exposed to the guest, that does not mean that the guest is actively using it. There are few uses or applications where TSX is used. If TSX is enabled on RHEL 8.3, it imposes a performance penalty on every context switch, therefore you might want to disable it. Do not enable TSX unconditionally.
Performance penalty on Intel Cascadelake CPUs
The performance penalty that results from setting the kernel argument to ‘tsx=on’ impacts Intel CascadeLake CPUs. Regardless of whether the guest is using TSX or not, if virtual machines are running on Intel Cascadelake CPUs and the kernel argument is set totsx=on, the host is slowed down.
Determining if environment is going to be impacted
Red Hat has developed a non-intrusive Ansible playbook that can be called from the undercloud node to validate all the Compute nodes.
[stack@undercloud-0 ~]$ source stackrc
(undercloud) [stack@undercloud-0 ~]$ curl -O https://access.redhat.com/sites/default/files/attachments/tsx_validation.yml
(undercloud) [stack@undercloud-0 ~]$ tripleo-ansible-inventory --static-yaml latest-inventory.yaml
(undercloud) [stack@undercloud-0 ~]$ ansible-playbook -i latest-inventory.yaml tsx_validation.yml
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.