After starting a Guest with NVIDIA A16 GPU, the GPU stops working

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 8, 9
  • KVM Guest with NVIDIA GPU Pass-Through

Issue

  • Starting a Guest makes the device unusable, all devices stop working

Resolution

  • As a workaround, please reboot the hypervisor and set the reset_method for the device to bus.
# echo bus > /sys/bus/pci/devices/<DEVICE PCI ADDRESS HERE>/reset_method
# cat /sys/bus/pci/devices/<DEVICE PCI ADDRESS HERE>/reset_method 
bus

Root Cause

Diagnostic Steps

  • In the hypervisor, following errors are seen on dmesg:
[16572.996687] Uhhuh. NMI received for unknown reason 2d on CPU 45.
[16572.996689] Uhhuh. NMI received for unknown reason 2d on CPU 14.
[16572.996691] Uhhuh. NMI received for unknown reason 2d on CPU 15.
[16572.996693] Do you have a strange power saving mode enabled?
[16572.996694] Do you have a strange power saving mode enabled?
[16572.996693] Uhhuh. NMI received for unknown reason 2d on CPU 12.
[16572.996694] Dazed and confused, but trying to continue
[16572.996694] Do you have a strange power saving mode enabled?
[16572.996695] Dazed and confused, but trying to continue
[16572.996696] Dazed and confused, but trying to continue
[16572.996696] Uhhuh. NMI received for unknown reason 2d on CPU 13.
[16572.996696] Do you have a strange power saving mode enabled?
[16572.996697] Dazed and confused, but trying to continue
[16572.996698] Do you have a strange power saving mode enabled?
[16572.996698] Uhhuh. NMI received for unknown reason 2d on CPU 46.
[16572.996698] Dazed and confused, but trying to continue
[16572.996699] Uhhuh. NMI received for unknown reason 2d on CPU 44.
[16572.996700] Do you have a strange power saving mode enabled?
[16572.996700] Dazed and confused, but trying to continue
[16572.996700] Do you have a strange power saving mode enabled?
[16572.996701] Dazed and confused, but trying to continue
[16572.996701] Uhhuh. NMI received for unknown reason 2d on CPU 47.
[16572.996703] Do you have a strange power saving mode enabled?
[16572.996704] Dazed and confused, but trying to continue
[16572.998195] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 8
[16572.998198] {1}[Hardware Error]: event severity: recoverable
[16572.998200] {1}[Hardware Error]:  Error 0, type: fatal
[16572.998203] {1}[Hardware Error]:   section_type: PCIe error
[16572.998204] {1}[Hardware Error]:   port_type: 4, root port
[16572.998206] {1}[Hardware Error]:   version: 3.0
[16572.998207] {1}[Hardware Error]:   command: 0x0547, status: 0x4010
[16572.998208] {1}[Hardware Error]:   device_id: 0000:20:03.1
[16572.998209] {1}[Hardware Error]:   slot: 0
[16572.998209] {1}[Hardware Error]:   secondary_bus: 0x00
[16572.998210] {1}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[16572.998210] {1}[Hardware Error]:   class_code: 000000
[16572.998211] {1}[Hardware Error]:   aer_uncor_status: 0x00004000, aer_uncor_mask: 0x07b10000
[16572.998212] {1}[Hardware Error]:   aer_uncor_severity: 0x004ef030
[16572.998212] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[16572.998250] pcieport 0000:20:03.1: AER: aer_status: 0x00004000, aer_mask: 0x07b10000
[16572.998253] pcieport 0000:20:03.1:    [14] CmpltTO                (First)
[16572.998255] pcieport 0000:20:03.1: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[16572.998256] pcieport 0000:20:03.1: AER: aer_uncor_severity: 0x004ef030
[16572.998262] pci 0000:25:00.0: AER: can't recover (no error_detected callback)
[16572.998310] pci 0000:29:00.0: AER: can't recover (no error_detected callback)
[16572.998311] pci 0000:2b:00.0: AER: can't recover (no error_detected callback)
[16573.217779] pcieport 0000:20:03.1: AER: Root Port link has been reset (0)
[16573.217850] pcieport 0000:20:03.1: AER: device recovery failed
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.