Affinity of managed interrupts cannot be changed even if they target isolated CPU

Solution Unverified - Updated

Environment

  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 7 starting with kernel version 3.10.0-957.el7

The issue was not observed on Red Hat Enterprise Linux 6.

Issue

  • Affinity of some interrupts cannot be changed via /proc/irq/X/... and such interrupts ignore irqaffinity kernel parameter from cmdline.
  • If such interrupts are so-called "managed" interrupts, then requests to change its affinity from userspace will not work.
  • Such "managed" interrupts can target isolated CPU cores and it will be impossible to re-target the IRQ to another CPU.

Resolution

  • In RHEL 8.2 (kernel-4.18.0-193.el8) and later, the managed_irq parameter of isolcpus provides an amount of best-effort isolation for kernel-managed IRQs. One can use isolcpus=managed_irq,domain,<cpu-list> syntax to isolate from being targeted by managed interrupts as well as exclude those cpus from the general SMP balancing and scheduling algorithms.

  • Even with managed_irq, there are many conditions around IO and available CPUs and driver-controlled interrupt affinity.

  • It is not possible to ensure some cores never handle managed IRQs. There will be situations where isolated CPUs handle some managed IRQs and this cannot be avoided.

  • Otherwise, it is not possible to control kernel-managed IRQs.

Root Cause

  • The special flags IRQD_AFFINITY_MANAGED and IRQD_MANAGED_SHUTDOWN were introduced for irq subsystem in kernel:
...
 * IRQD_AFFINITY_MANAGED        - Affinity is auto-managed by the kernel
...
 * IRQD_MANAGED_SHUTDOWN        - Interrupt was shutdown due to empty affinity
 *                                mask. Applies only to affinity managed irqs.
...
  • A driver that wants to control IRQ affinity by itself can request and register a "managed" interrupt. The kernel (specifically, the device driver) controls the affinity property and will ignore cmdline options or userspace requests. There will be no other sources to control affinity of such IRQs except the driver. The irqbalance daemon will not be able to manage it. It is done for multi-queues or per-CPU queues data structures used in drivers, for instance the nvme driver. Userspace should not care about such "managed" interrupts.

  • The managed_irq parameter was added upstream with Content from git.kernel.org is not included.genirq, sched/isolation: Isolate from handling managed interrupts and was backported to RHEL 8.2 under Red Hat Private Bug 1783026.

Diagnostic Steps

  • First of all, any attempts to change irq affinity via /proc filesystem will not work.

For example:

[root@machine ~]# echo 5 > /proc/irq/31/smp_affinity_list 
-bash: echo: write error: Input/output error
[root@machine ~]# 
  • If you pass, for example, irqaffinity=0 to kernel via cmdline, then manage interrupt will ignore such parameter.

The /proc/interrupts can continue to look like this:

[root@machine ~]# cat /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       CPU4       
   0:         75          0          0          0          0   IO-APIC    2-edge      timer
   1:          4          0          0          0          0   IO-APIC    1-edge      i8042
   8:          1          0          0          0          0   IO-APIC    8-edge      rtc0
   9:          0          0          0          0          0   IO-APIC    9-fasteoi   acpi
  12:          6          0          0          0          0   IO-APIC   12-edge      i8042
  16:          0          0          0          0          0   IO-APIC   16-fasteoi   uhci_hcd:usb3, hpilo
  20:        881          0          0          0          0   IO-APIC   20-fasteoi   ehci_hcd:usb2
  21:       7255          0          0          0          0   IO-APIC   21-fasteoi   ehci_hcd:usb1
  27:         18          0          0          0          0   PCI-MSI 2097152-edge      nvme0q0
  28:          0      10582          0          0          0   PCI-MSI 2097153-edge      nvme0q1
  29:          0          0         59          0          0   PCI-MSI 2097154-edge      nvme0q2
  30:          0          0          0        117          0   PCI-MSI 2097155-edge      nvme0q3
  31:          0          0          0          0      11218   PCI-MSI 2097156-edge      nvme0q4
  32:          0          0          0          0          0   PCI-MSI 2097157-edge      nvme0q5
  33:          0          0          0          0          0   PCI-MSI 2097158-edge      nvme0q6
  34:          0          0          0          0          0   PCI-MSI 2097159-edge      nvme0q7
...

In the example above, IRQs 27 to 31 are managed interrupts.

  • Off-lining and enabling CPU back in order to let IRQ migrate to another CPU will not help. When CPU will be enabled, the kernel will restore affinity of managed interrupt if it was targeting this CPU.

How to understand if some IRQ is managed?

There are two options:

  • By reading and investigating the source code to check that driver requests and uses (or not) managed interrupts.

    • RHEL 7.5 (3.10.0-862.el7) or later - If the driver sets the PCI_IRQ_AFFINITY flag then interrupts are managed by the kernel.
    • Earlier than the above - If the driver does not set the PCI_IRQ_NOAFFINITY flag then interrupts are managed by the kernel.

See upstream patch Content from git.kernel.org is not included.PCI: Use positive flags in pci_alloc_irq_vectors() for an explanation of the most recent behaviour patch in this area.

  • Only for RHEL8: using debugfs. Boot debug kernel or any other custom kernel with CONFIG_GENERIC_IRQ_DEBUGFS. In such case the path /sys/kernel/debug/irq/ will exist.
    Then start checking files with IRQ numbers under /sys/kernel/debug/irq/irqs.
    For example, we are interested in the IRQ 31.
[root@machine irqs]# cat /sys/kernel/debug/irq/irqs/31 
handler:  handle_edge_irq
device:   0000:04:00.0
status:   0x00000000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01600200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_AFFINITY_MANAGED
node:     1
affinity: 4,9,11,13,15,17,19,21
effectiv: 4
pending:  
domain:  PCI-MSI-2
 hwirq:   0x200004
 chip:    PCI-MSI
  flags:   0x30
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  VECTOR
     hwirq:   0x1f
     chip:    APIC
      flags:   0x0
     Vector:    33
     Target:     4
     move_in_progress: 0
     is_managed:       1
     can_reserve:      0
     has_reserved:     0
     cleanup_pending:  0
[root@machine irqs]# 
  • We can see that flag IRQD_AFFINITY_MANAGED is listed and is_managed field is set to 1.
    Not managed interrupts will not have this.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.