Seeing "Temperature above threshold" or "Core power limit notification" in /var/log/messages
Environment
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 6
- Red Hat Enterprise Linux 5
- Red Hat Enterprise Linux 4
Issue
- Warnings similar to the following are being logged in /var/log/messages
kernel: CPU16: Temperature above threshold, cpu clock throttled
or
kernel: CPU9: Core power limit notification (total events = 1)
kernel: CPU0: Core power limit notification (total events = 1)
CPU6: Core power limit notification (total events = 1)
[...]
kernel: CPU0: Core power limit normal
kernel: CPU6: Core power limit normal
kernel: CPU9: Core power limit normal
or
mcelog: Processor 13 heated above trip temperature. Throttling enabled.
mcelog: Please check your system cooling. Performance will be impacted
or
server1 cmaeventd[xxxx]: Controller temperature limit reached, Controller=3 Sensor=0:1 Temp=104 Limit=10
server1 hp-ams[xxxx]: WARNING: System Overheating (Temperature Sensor 31, Location I/O Board, Temperature 101C)
server1 hp-ams[xxxx]: CRITICAL: Automatic Operating System Shutdown Initiated Due to Overheat Condition
server1 hpasmlited[xxxx]: WARNING: System Overheating (Temperature Sensor 31, Location I/O Board, Temperature 101C)
server1 hpasmlited[xxxx]: A System Reboot has been requested by the management processor in 60 seconds.
Resolution
- The message "Core power limit notification" is safe to ignore in most situations and has been completely removed starting with the following versions of Red Hat Enterprise Linux.
(tracked via private This content is not included.RHBZ#908990)- RHEL6: kernel-2.6.32-407.el6(This content is not included.RHBZ#908990) or later
- RHEL6.4.z: kernel-2.6.32-358.20.1.el6(This content is not included.RHBZ#999328) or later
- RHEL6.3.z: kernel-2.6.32-279.39.1.el6(This content is not included.RHBZ#1020527) or later
- RHEL6.2.z: kernel-2.6.32-220.44.1.el6(This content is not included.RHBZ#1020519) or later
- The message "Temperature above threshold, cpu clock throttled" is also safe to ignore when happening for short periods of time.
- Neither of these messages will necessarily be associated with a perofrmance issue. If you are facing a performance issue and seeing these messages you should gather more information about the situation.
Root Cause
-
The underlying hardware thinks the temperature is getting too high and throttles the cpu. It announces these events to the kernel through interrupts. In both cases, the kernel prints these messages after receiving the associated hardware interrupts. The kernel takes no further action aside from printing these messages and incrementing the related sysfs counters. Any throttling or power limiting taking place is done by the hardware.
-
The cause could be anything from actual overheating (failing fans, clogged air intakes, etc), faulty onboard sensors (false alarms), buggy firmware/bios, unreasonable (too low) thresholds configured, etc.
-
It is essential to determine which it is, and not mask the problem (by disabling c-states, for instance) before that, to avoid damage to the machine.
-
Previously, power-limit notification interrupts were enabled by default on the affected kernel. A patch has been applied to disable power-limit notification interrupts by default and a new kernel command line parameter "int_pln_enable" has been added to allow users observing these events using the existing system counters. Power-limit notification messages are also no longer displayed on the console. The affected platforms no longer suffer from degraded system performance due to this problem.
This change was made in the upstream Linux kernel with the following commit.
Content from bugzilla.kernel.org is not included.https://bugzilla.kernel.org/show_bug.cgi?id=36182
Diagnostic Steps
-
Check the state of the machine when this happens. Most server-grade hardware will support ipmi (see openipmi tools), if needed ask the hardware vendor how to monitor sensor data on that particular machine, in Linux.
-
Based on the data above, make sure the alarm is not indeed justified. Ask the hardware vendor if the values are normal with the given workload.
-
Continue depending on the findings above; in most cases the hardware vendor will assist.
- In one particular case, the problem was solved by changing certain hardware components. The final resolution can range from "works as designed, nothing wrong" to "need to replace fan/mainboard/datacenter cooling/etc"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.