Seeing "Temperature above threshold" or "Core power limit notification" in /var/log/messages

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 8
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 4

Issue

  • Warnings similar to the following are being logged in /var/log/messages
kernel: CPU16: Temperature above threshold, cpu clock throttled

or

kernel: CPU9: Core power limit notification (total events = 1)
kernel: CPU0: Core power limit notification (total events = 1)
CPU6: Core power limit notification (total events = 1)
[...]
kernel: CPU0: Core power limit normal
kernel: CPU6: Core power limit normal
kernel: CPU9: Core power limit normal

or

mcelog: Processor 13 heated above trip temperature. Throttling enabled.
mcelog: Please check your system cooling. Performance will be impacted

or

server1 cmaeventd[xxxx]: Controller temperature limit reached, Controller=3 Sensor=0:1 Temp=104 Limit=10
server1 hp-ams[xxxx]: WARNING: System Overheating (Temperature Sensor 31, Location I/O Board, Temperature 101C)
server1 hp-ams[xxxx]: CRITICAL: Automatic Operating System Shutdown Initiated Due to Overheat Condition
server1 hpasmlited[xxxx]: WARNING: System Overheating (Temperature Sensor 31, Location I/O Board, Temperature 101C)
server1 hpasmlited[xxxx]: A System Reboot has been requested by the management processor in 60 seconds.

Resolution

Root Cause

  • The underlying hardware thinks the temperature is getting too high and throttles the cpu. It announces these events to the kernel through interrupts. In both cases, the kernel prints these messages after receiving the associated hardware interrupts. The kernel takes no further action aside from printing these messages and incrementing the related sysfs counters. Any throttling or power limiting taking place is done by the hardware.

  • The cause could be anything from actual overheating (failing fans, clogged air intakes, etc), faulty onboard sensors (false alarms), buggy firmware/bios, unreasonable (too low) thresholds configured, etc.

  • It is essential to determine which it is, and not mask the problem (by disabling c-states, for instance) before that, to avoid damage to the machine.

  • Previously, power-limit notification interrupts were enabled by default on the affected kernel. A patch has been applied to disable power-limit notification interrupts by default and a new kernel command line parameter "int_pln_enable" has been added to allow users observing these events using the existing system counters. Power-limit notification messages are also no longer displayed on the console. The affected platforms no longer suffer from degraded system performance due to this problem.

    This change was made in the upstream Linux kernel with the following commit.

Content from bugzilla.kernel.org is not included.https://bugzilla.kernel.org/show_bug.cgi?id=36182

Diagnostic Steps

  1. Check the state of the machine when this happens. Most server-grade hardware will support ipmi (see openipmi tools), if needed ask the hardware vendor how to monitor sensor data on that particular machine, in Linux.

  2. Based on the data above, make sure the alarm is not indeed justified. Ask the hardware vendor if the values are normal with the given workload.

  3. Continue depending on the findings above; in most cases the hardware vendor will assist.

  • In one particular case, the problem was solved by changing certain hardware components. The final resolution can range from "works as designed, nothing wrong" to "need to replace fan/mainboard/datacenter cooling/etc"
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.