What are CPU "C-states" and how to disable them if needed?
Environment
- Red Hat Enterprise Linux 9
- Red Hat Enterprise Linux 8
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 6
- Red Hat Enterprise Linux 5
Issue
- What are C-states, cstates, or C-modes?
- How can I disable processor sleep states?
- How to prevent the kernel to override the BIOS C-state option?
- Is
intel_idle.max_cstate=0required to disable CPU C-states?
Resolution
To limit a CPU to a certain C-state, you can pass the processor.max_cstate=X option in the kernel line of /boot/grub/grub.conf in RHEL 6 and below, or in /etc/sysconfig/grub , on the GRUB_CMDLINE_LINUX line in RHEL 7/RHEL 8/RHEL 9. Remember that you'll need to propagate the changes to disk using grub2-mkconfig in RHEL 7/RHEL 8/RHEL 9.
Here we limit the system to only C-State 1:
kernel /vmlinuz-2.6.18-371.1.2.el5 ... processor.max_cstate=1
On some systems, the kernel can override the BIOS setting, and the parameter intel_idle.max_cstate=0 may be required to ensure sleep states are not entered:
kernel /vmlinuz-2.6.32-431.el6.x86_64 ... processor.max_cstate=1 intel_idle.max_cstate=0
The maximum allowed CPU C-State can be checked by verifying the presence of max_cstate boot parameters within the /proc/cmdline file:
# grep max_cstate /proc/cmdline
Why the OS might ignore BIOS settings
The OS might ignore BIOS settings based on the idle driver which is in use. If one uses intel_idle (the default on intel machines) the OS can ignore ACPI and BIOS settings, i.e. the driver can re-enable the C-states. In case one disables intel_idle and uses the older acpi_idle driver the OS should follow the BIOS settings. One can disable the intel_idle driver by:
- passing
intel_idle.max_cstate=0to kernel boot command line or - passing
idle=*(where*can be e.g.poll, i.e.idle=poll)
In such case the intel_idle will not load during boot and the machine should use acpi_idle driver. One can check what driver is in use by:
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
Moreover if acpi_idle is used, it sets C1 even if C0 is forced. See below.
NOTE: Passing idle=poll or idle=halt will also disable acpi_idle. The idle=poll use poll idle loop that just executes nop instruction & lock CPUs to C0. The idle=halt leverages hlt instruction to limit CPUs state to C1.
CAUTION: Do not lock cpu to C0 state or POLL state which is equivalent to idle=poll. Locking all CPUs to C0 has negative consequences, such as higher power consumption and increased generation of heat, among others. Locking all CPUs to C0 might void CPU warranty. Please check with the respective hardware vendor.
About the meaning of processor.max_cstate and intel_idle.max_cstate
In order to understand those 2 parameters better, one can refer to the www.kernel.org document and the kernel source code.
processor.max_cstate=1 intel_idle.max_cstate=0
kernel-debuginfo-common-x86_64
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
(...)
intel_idle.max_cstate= [KNL,HW,ACPI,X86]
0 disables intel_idle and fall back on acpi_idle.
1 to 9 specify maximum depth of C-state.
(...)
processor.max_cstate= [HW,ACPI]
Limit processor to maximum C-state
max_cstate=9 overrides any DMI blacklist limit.
Note that intel_idle.max_state = 0 disables intel_idle and lets acpi_idle (processor.max_state) take over.
vim /usr/src/debug/kernel-3.10.0-693.19.1.el7/linux-3.10.0-693.19.1.el7.x86_64/drivers/idle/intel_idle.c
(... set number ...)
889 /*
890 * intel_idle_probe()
891 */
892 static int __init intel_idle_probe(void)
893 {
894 unsigned int eax, ebx, ecx;
895 const struct x86_cpu_id *id;
896
897 if (max_cstate == 0) {
898 pr_debug(PREFIX "disabled\n");
899 return -EPERM;
900 }
(...)
Next, looking at the acpi processor_idle code:
vim /usr/src/debug/kernel-3.10.0-693.19.1.el7/linux-3.10.0-693.19.1.el7.x86_64/drivers/acpi/processor_idle.c
(... set number ...)
915 if (max_cstate == 0)
916 max_cstate = 1;
(...)
This means that processor.max_cstate=0 intel_idle.max_cstate=0 and processor.max_cstate=1 intel_idle.max_cstate=0 are exactly equivalent.
Note: Some of the c-states like C1E can only be enabled if intel_idle driver is in use, even if it is disabled from BIOS.
Using tuned to set CPU C-States
Using tuned profiles gives improved flexibility and removes the need for a restart. In Red Hat Enterprise Linux, it is possible to set C-States by simply changing a tuned profile. For more details about this option, see:
- This content is not included.How do I create my own tuned profile on RHEL6 ?
- How to lock all CPUs to C-State 0 with the cpu-partitioning profile in Red Hat Enterprise Linux and Red Hat OpenStack Platform?
Root Cause
In order to save energy when the CPU is idle, the CPU can be commanded to enter a low-power mode. Each CPU has several power modes and they are collectively called “C-states” or “C-modes.”.
The lower-power mode was first introduced with the 486DX4 processor. To the present, more power modes has been introduced and enhancements has been made to each mode for the CPU to consume less power in these low-power modes. The idea of these modes is to cut the clock signal and power from idle units inside the CPU. As many units you stop (by cutting the clock) as you reduce the voltage or even completely shut down to save energy. On the other hand, you have to take into account that more time is required for the CPU to “wake up” and be again 100% operational. These modes are known as C-states. They are usually starting in C0, which is the normal CPU operating mode, i.e., the CPU is 100% turned on. With increasing C number, the CPU sleep mode is deeper, i.e., more circuits and signals are turned off and more time the CPU will require to return to C0 mode, i.e., to wake-up. Each mode is also known by a name and several of them have sub-modes with different power saving – and thus wake-up time – levels.
| mode | Name | What id does | CPUs |
|---|---|---|---|
| C0 | Operating State | CPU fully turned on | All CPUs |
| C1 | Halt | Stops CPU main internal clocks via software; bus interface unit and APIC are kept running at full speed | 486DX4 and above |
| C1E | Enhanced Halt | Stops CPU main internal clocks via software and reduces CPU voltage; bus interface unit and APIC are kept running at full speed | All socket 775 CPUs |
| C1E | -- | Stops all CPU internal clocks | Turion 64, 65-nm Athlon X2 and Phenom CPUs |
| C2 | Stop Grant | Stops CPU main internal clocks via hardware; bus interface unit and APIC are kept running at full speed | 486DX4 and above |
| C2 | Stop Clock | Stops CPU internal and external clocks via hardware | Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III |
| C2E | Extended Stop Grant | Stops CPU main internal clocks via hardware and reduces CPU voltage; bus interface unit and APIC are kept running at full speed | Core 2 Duo and above (Intel only) |
| C3 | Sleep | Stops all CPU internal clocks | Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000 |
| C3 | Deep Sleep | Stops all CPU internal and external clocks | Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64 |
| C3 | AltVID | Stops all CPU internal clocks and reduces CPU voltage | AMD Turion 64 |
| C4 | Deeper Sleep | Reduces CPU voltage | Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64 |
| C4E/C5 | Enhanced Deeper Sleep | Reduces CPU voltage even more and turns off the memory cache | Core Solo, Core Duo and 45-nm mobile Core 2 Duo only |
| C6 | Deep Power Down | Reduces the CPU internal voltage to any value, including 0 V | 45-nm mobile Core 2 Duo only |
| C7 | Deep Energy Saving | The CPU tries to flush its L3 cache. If the L3 cache is able to be entirely cleared, the CPU cuts its power to save energy. The power from the system agent is removed too. | — |
| C7s | — | When an MWAIT(C7) command is issued with a C7s sub-state hint, the entire L3 cache is flushed in one step as opposed to flushing the L3 cache in multiple steps. This also allows the system to send I/O devices to low power mode to reduce unnecessary power consumption when the system idles down. | — |
| C8 | — | The L3 cache is flushed in a single step. The power to the PLL is cut. | — |
| C9 | — | The VCCIN (VCC Input Voltage) gets lowered to a minimum. | — |
| C10 | — | The single phase core management system, VR12.6, goes into a low-power state. The CPU is almost shut down. | — |
Diagnostic Steps
In case is not possible to have a maintenance window to try disabling c-states, we advise to inquiry your hardware vendor about the power savings configuration on the BIOS.The following should be checked:
- Does the current BIOS have "power savings" feature?
- Is there any known bug related to "power savings" feature, that affects your BIOS version?
- Is "power savings" feature enabled? If yes, which mode is set?
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.