Disabling RHEL online CPU on Azure cloud throws an error
Environment
- Azure cloud ARM mode
- RHEL 6.8+
- RHEL 7.2+
Issue
- Disabling online CPU (RHEL VM) on azure throws an error.
- Provision Standard DS series rhel instance on Azure with RHEL 7.2 and RHEL 6.8 version . While disabling some active/online CPU's in Azure RHEL, it throws below errors.
[root@XXXX ~]# echo 0 > /sys/devices/system/cpu/cpu19/online
-bash: echo: write error: Function not implemented
[root@XXXX ~]# cat /sys/devices/system/cpu/online
0-19
- Following scenarios are executed to confirm this issue and found below results:
a) Using maxcpus parameter : Add kernel parameter maxcpus=N at boot time
b) Disabling CPU Online : At runtime disable cpu cores online via command line.
Results:
RHEL max_cpus=1 cpu/online/offline (echo 0 > /sys/devices/system/cpu/cpu1/online)
6.7 no panic, 1 cpu yes (hangs)
6.8 no panic, 1 cpu no (write error: Function not implemented)
7.2 panic no (write error: Function not implemented)
Resolution
Issue is tracked with the below :
- Disabling online CPU (RHEL 7.2 VM) on Azure throws an error This content is not included.Bug 1396335
- Disabling online CPU (RHEL 6.8 VM) on azure throws an error This content is not included.Bug 1396336
What is nr_cpus parameter?
nr_cpus= [SMP] Maximum number of processors that an SMP kernel could support. nr_cpus=n : n >= 1 limits the kernel to supporting 'n' processors. Later in runtime you can not use hotplug cpu feature to put more cpu back to online, just like you compile the kernel NR_CPUS=n. From Content from www.kernel.org is not included.Kernel Document
For RHEL 6.8+
- 'nr_cpus=xx' parameter. Add this kernel parameter at boot time. But, CPUs cannot be online/offline manually after boot.
- Online/Offline mechanism will likely be reworked in upstream and will be backported to RHEL 7.4 . No upstream fix will be backported to RHEL6+ .
- This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX
For RHEL 7.2+
Work-around
- Disable the udev rule and boot with 'maxcpus=xx' limitation. It will be possible to online/offline these not-onlined-by-default CPUs after boot.
- maxcpus=xx as boot parameter and changing the 40-redhat.rules udev rule
In /usr/lib/udev/rules.d/40-redhat.rules, comment out the below line by adding a # to the beginning of the line:
# SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1"
and next run “dracut --force” to update the initramfs image.
On Hyper-V there is no hot add/remove of vCPUs, so hot plugging them should not be attempted.
- The crash will likely occur if any of the disabled CPUs are tried to online manually after boot.
Permanent changes
- Boot with 'nr_cpus=xx' limitation and have all unneeded CPUs permanently disabled.
- It's possible to disable unneeded CPUs permanently with 'nr_cpus=xx' kernel parameter. In this case no modifications to udev rules will be required.
- Errata released as part of RHEL 7.4 release
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2017:1842
Root Cause
The behaviour is expected. On Hyper-V VMBus channels are spread across all available CPUs on boot and high-performance devices like network and storage have sub-channels on all available CPUs (not exactly true in NUMA case but let's assume you have only 1 numa node). Currently, there is no Hyper-V API to rebind a VMBus channel to another CPU when the CPU it's bound to goes offline so we added a commit upstream which prevents all CPUs from going offline. You can see it in dmesg:
[ 6.570360] hv_vmbus: CPU offlining is not supported by hypervisor
In RHEL7 the above mentioned commit is present since RHEL-7.2. Versions prior to it will likely hang on CPU offlining.
Note: Hyper-V is used as an underlying virtualization services/infrastructure for Microsoft Azure VMs.
Diagnostic Steps
- Error from Boot Diagnostic serial logs
] Started udev Coldplug all Devices.[ 2.606698] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2.607004] IP: [<ffffffffa001a5e3>] hv_post_message+0x43/0xd0 [hv_vmbus]
[ 2.607004] PGD 0
[
2.607004] Oops: 0002 [#1] SMP
[ 2.607004] Modules linked in: hv_netvsc(+) ata_piix(+) crct10dif_pclmul crct10dif_common crc32c_intel libata serio_raw hv_vmbus floppy
[ 2.607004] CPU: 1 PID: 332 Comm: systemd-udevd Not tainted 3.10.0-327.36.3.el7.x86_64 #1
[ 2.607004] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012
[ 2.607004] task: ffff880290d6d080 ti: ffff8802908f8000 task.ti: ffff8802908f8000
[ 2.607004] RIP: 0010:[<ffffffffa001a5e3>] [<ffffffffa001a5e3>] hv_post_message+0x43/0xd0 [hv_vmbus]
[ 2.607004] RSP: 0018:ffff8802908fb9e8 EFLAGS: 00010287
[ 2.607004] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000000000ec
[ 2.607004] RDX: 00000000000000ec RSI: 0000000000000001 RDI: 0000000000000001
[ 2.607004] RBP: ffff8802908fba00 R08: ffff880290bd0458 R09: ffff880000000000
[ 2.607004] R10: ffff880297003800 R11: ffffffffffffffe4 R12: 00000000000000ec
[ 2.607004] R13: ffff880290bd0458 R14: 0000000000001000 R15: ffff880290bd0400
[ 2.607004] FS: 00007fc5beac2880(0000) GS:ffff880297640000(0000) knlGS:0000000000000000
[ 2.607004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.607004] CR2: 0000000000000000 CR3: 00000002908e8000 CR4: 00000000001406e0
[ 2.607004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2.607004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2.607004] Stack:
[ 2.607004] ffff880297003600 000000009765188a 000000000000000a ffff8802908fba30
[ 2.607004] ffffffffa001aeb0 0000000000000246 ffffc90000f38000 ffff880290bf6238
[ 2.607004] 0000000000001000 ffff8802908fbaa8 ffffffffa001c009 ffff880290b9003c
[ 2.607004] Call Trace:
[ 2.607004] [<ffffffffa001aeb0>] vmbus_post_msg+0x30/0xa0 [hv_vmbus]
[ 2.607004] [<ffffffffa001c009>] vmbus_establish_gpadl+0x2b9/0x3f0 [hv_vmbus]
[ 2.607004] [<ffffffffa00b679f>] netvsc_device_add+0x47f/0x870 [hv_netvsc]
[ 2.607004] [<ffffffffa00b7cfa>] rndis_filter_device_add+0x7a/0x500 [hv_netvsc]
[ 2.607004] [<ffffffff8152ed6d>] ? alloc_netdev_mqs+0x21d/0x360
[ 2.607004] [<ffffffff8152eda8>] ? alloc_netdev_mqs+0x258/0x360
[ 2.607004] [<ffffffffa00b52c4>] netvsc_probe+0x164/0x250 [hv_netvsc]
[ 2.607004] [<ffffffffa001a05e>] vmbus_probe+0x3e/0xa0 [hv_vmbus]
[ 2.607004] [<ffffffff813f6907>] driver_probe_device+0x87/0x390
[ 2.607004] [<ffffffff813f6ce3>] __driver_attach+0x93/0xa0
[ 2.607004] [<ffffffff813f6c50>] ? __device_attach+0x40/0x40
[ 2.607004] [<ffffffff813f4673>] bus_for_each_dev+0x73/0xc0
[ 2.607004] [<ffffffff813f635e>] driver_attach+0x1e/0x20
[ 2.607004] [<ffffffff813f5eb0>] bus_add_driver+0x200/0x2d0
[ 2.607004] [<ffffffff813f7364>] driver_register+0x64/0xf0
[ 2.607004] [<ffffffffa001a017>] __vmbus_driver_register+0x57/0x60 [hv_vmbus]
[ 2.607004] [<ffffffffa00fb000>] ? 0xffffffffa00fafff
[ 2.607004] [<ffffffffa00fb044>] netvsc_drv_init+0x44/0x1000 [hv_netvsc]
[ 2.607004] [<ffffffff810020e8>] do_one_initcall+0xb8/0x230
[ 2.607004] [<ffffffff810ed74e>] load_module+0x134e/0x1b50
[ 2.607004] [<ffffffff81316b50>] ? ddebug_proc_write+0xf0/0xf0
[ 2.607004] [<ffffffff810e99e3>] ? copy_module_from_fd.isra.42+0x53/0x150
[ 2.607004] [<ffffffff810ee106>] SyS_finit_module+0xa6/0xd0
[ 2.607004] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[ 2.607004] Code: 89 45 f0 31 c0 48 81 f9 f0 00 00 00 b8 a6 ff ff ff 77 7a 49 89 d0 65 8b 04 25 1c a0 00 00 48 98 48 89 ca 48 8b 1c c5 78 7b 05 a0 <89> 3b 48 8d 7b 10 89 73 08 c7 43 04 00 00 00 00 89 4b 0c 4c 89
[ 2.607004] RIP [<ffffffffa001a5e3>] hv_post_message+0x43/0xd0 [hv_vmbus]
[ 2.607004] RSP <ffff8802908fb9e8>
[ 2.607004] CR2: 0000000000000000
[ 3.170949] ---[ end trace cb40d90255897b28 ]---
[ 3.178012] Kernel panic - not syncing: Fatal exception
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.