Affinity of managed interrupts cannot be changed even if they target isolated CPU
Environment
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 7 starting with kernel version
3.10.0-957.el7
The issue was not observed on Red Hat Enterprise Linux 6.
Issue
- Affinity of some interrupts cannot be changed via
/proc/irq/X/...and such interrupts ignoreirqaffinitykernel parameter from cmdline. - If such interrupts are so-called "managed" interrupts, then requests to change its affinity from userspace will not work.
- Such "managed" interrupts can target isolated CPU cores and it will be impossible to re-target the IRQ to another CPU.
Resolution
-
In RHEL 8.2 (
kernel-4.18.0-193.el8) and later, themanaged_irqparameter ofisolcpusprovides an amount of best-effort isolation for kernel-managed IRQs. One can useisolcpus=managed_irq,domain,<cpu-list>syntax to isolate from being targeted by managed interrupts as well as exclude those cpus from the general SMP balancing and scheduling algorithms. -
Even with
managed_irq, there are many conditions around IO and available CPUs and driver-controlled interrupt affinity. -
It is not possible to ensure some cores never handle managed IRQs. There will be situations where isolated CPUs handle some managed IRQs and this cannot be avoided.
-
Otherwise, it is not possible to control kernel-managed IRQs.
Root Cause
- The special flags
IRQD_AFFINITY_MANAGEDandIRQD_MANAGED_SHUTDOWNwere introduced for irq subsystem in kernel:
...
* IRQD_AFFINITY_MANAGED - Affinity is auto-managed by the kernel
...
* IRQD_MANAGED_SHUTDOWN - Interrupt was shutdown due to empty affinity
* mask. Applies only to affinity managed irqs.
...
-
A driver that wants to control IRQ affinity by itself can request and register a "managed" interrupt. The kernel (specifically, the device driver) controls the affinity property and will ignore cmdline options or userspace requests. There will be no other sources to control affinity of such IRQs except the driver. The
irqbalancedaemon will not be able to manage it. It is done for multi-queues or per-CPU queues data structures used in drivers, for instance thenvmedriver. Userspace should not care about such "managed" interrupts. -
The
managed_irqparameter was added upstream with Content from git.kernel.org is not included.genirq, sched/isolation: Isolate from handling managed interrupts and was backported to RHEL 8.2 under Red Hat Private Bug 1783026.
Diagnostic Steps
- First of all, any attempts to change irq affinity via /proc filesystem will not work.
For example:
[root@machine ~]# echo 5 > /proc/irq/31/smp_affinity_list
-bash: echo: write error: Input/output error
[root@machine ~]#
- If you pass, for example,
irqaffinity=0to kernel via cmdline, then manage interrupt will ignore such parameter.
The /proc/interrupts can continue to look like this:
[root@machine ~]# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4
0: 75 0 0 0 0 IO-APIC 2-edge timer
1: 4 0 0 0 0 IO-APIC 1-edge i8042
8: 1 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 6 0 0 0 0 IO-APIC 12-edge i8042
16: 0 0 0 0 0 IO-APIC 16-fasteoi uhci_hcd:usb3, hpilo
20: 881 0 0 0 0 IO-APIC 20-fasteoi ehci_hcd:usb2
21: 7255 0 0 0 0 IO-APIC 21-fasteoi ehci_hcd:usb1
27: 18 0 0 0 0 PCI-MSI 2097152-edge nvme0q0
28: 0 10582 0 0 0 PCI-MSI 2097153-edge nvme0q1
29: 0 0 59 0 0 PCI-MSI 2097154-edge nvme0q2
30: 0 0 0 117 0 PCI-MSI 2097155-edge nvme0q3
31: 0 0 0 0 11218 PCI-MSI 2097156-edge nvme0q4
32: 0 0 0 0 0 PCI-MSI 2097157-edge nvme0q5
33: 0 0 0 0 0 PCI-MSI 2097158-edge nvme0q6
34: 0 0 0 0 0 PCI-MSI 2097159-edge nvme0q7
...
In the example above, IRQs 27 to 31 are managed interrupts.
- Off-lining and enabling CPU back in order to let IRQ migrate to another CPU will not help. When CPU will be enabled, the kernel will restore affinity of managed interrupt if it was targeting this CPU.
How to understand if some IRQ is managed?
There are two options:
-
By reading and investigating the source code to check that driver requests and uses (or not) managed interrupts.
- RHEL 7.5 (
3.10.0-862.el7) or later - If the driver sets thePCI_IRQ_AFFINITYflag then interrupts are managed by the kernel. - Earlier than the above - If the driver does not set the
PCI_IRQ_NOAFFINITYflag then interrupts are managed by the kernel.
- RHEL 7.5 (
See upstream patch Content from git.kernel.org is not included.PCI: Use positive flags in pci_alloc_irq_vectors() for an explanation of the most recent behaviour patch in this area.
- Only for RHEL8: using debugfs. Boot debug kernel or any other custom kernel with
CONFIG_GENERIC_IRQ_DEBUGFS. In such case the path/sys/kernel/debug/irq/will exist.
Then start checking files with IRQ numbers under/sys/kernel/debug/irq/irqs.
For example, we are interested in the IRQ 31.
[root@machine irqs]# cat /sys/kernel/debug/irq/irqs/31
handler: handle_edge_irq
device: 0000:04:00.0
status: 0x00000000
istate: 0x00000000
ddepth: 0
wdepth: 0
dstate: 0x01600200
IRQD_ACTIVATED
IRQD_IRQ_STARTED
IRQD_SINGLE_TARGET
IRQD_AFFINITY_MANAGED
node: 1
affinity: 4,9,11,13,15,17,19,21
effectiv: 4
pending:
domain: PCI-MSI-2
hwirq: 0x200004
chip: PCI-MSI
flags: 0x30
IRQCHIP_SKIP_SET_WAKE
IRQCHIP_ONESHOT_SAFE
parent:
domain: VECTOR
hwirq: 0x1f
chip: APIC
flags: 0x0
Vector: 33
Target: 4
move_in_progress: 0
is_managed: 1
can_reserve: 0
has_reserved: 0
cleanup_pending: 0
[root@machine irqs]#
- We can see that flag
IRQD_AFFINITY_MANAGEDis listed andis_managedfield is set to 1.
Not managed interrupts will not have this.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.