"Kernel panic - not syncing: Fatal Machine check or Machine Check Exception (MCE)" in /var/log/messages
Environment
- Red Hat Enterprise Linux
Issue
- System hangs or kernel panics with MCE (Machine Check Exception) in
/var/log/messagesfile. - System is hung or not responding. Checked the messages in netdump server. Found the following messages ..."Kernel panic - not syncing: Machine check".
- "Kernel panic - not syncing: Uncorrected machine check"
- System reported hardware error like faulty DIMM or temperature warning before hanging
- System rebooted due to Machine Check Exception and a vmcore was collected.
- "Kernel panic - not syncing: Fatal machine check on current CPU"
Kernel panic - not syncing: Fatal Machine check
Pid: 0, comm: swapper Tainted: G M ---------------- 2.6.32-220.el6.x86_64 #1
Call Trace:
<#MC> [<ffffffff814ec341>] ? panic+0x78/0x143
[<ffffffff81021d7f>] ? mce_panic+0x21f/0x240
[<ffffffff81023638>] ? do_machine_check+0xa18/0xa60
[<ffffffff812c4a41>] ? intel_idle+0xb1/0x170
[<ffffffff814ef86c>] ? machine_check+0x1c/0x30
[<ffffffff812c4a41>] ? intel_idle+0xb1/0x170
<<EOE>> [<ffffffff81095d98>] ? hrtimer_start+0x18/0x20
[<ffffffff813f9f67>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5f43>] ? start_secondary+0x202/0x245
NODENAME: hostname
RELEASE: 2.6.32-573.3.1.el6.x86_64
VERSION: #1 SMP Mon Aug 10 09:44:54 EDT 2015
MACHINE: x86_64 (2266 Mhz)
MEMORY: 24 GB
PANIC: "Kernel panic - not syncing: Fatal machine check on current CPU" << panic message
PID: 0
COMMAND: "swapper"
TASK: ffff88037b318040 (1 of 12) [THREAD_INFO: ffff88037b324000]
CPU: 11
STATE: TASK_RUNNING (PANIC)
- /var/log/messages or /var/log/mcelog contain the following messages :
kernel: Machine check events logged
mcelog: MCE 0
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
mcelog: CPU 0 BANK 8 TSC a66b05434fcf4 [at 2668 Mhz 12 days 16:48:42 uptime (unreliable)]
mcelog: MISC 5522140800080282 ADDR 4f83b8dc0
mcelog: MCG status:
mcelog: MCi status:
mcelog: MCi_MISC register valid
mcelog: MCi_ADDR register valid
mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
mcelog: Transaction: Memory read error
mcelog: STATUS 8c0000400001009f MCGSTATUS 0
kernel: BUG: soft lockup - CPU#10 stuck for 10s! [mcelog:6356]
- Other similar errors:
Hardware event. This is not a software error.
Corrected error
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory DIMM ID of error: 1
Memory channel ID of error: 2
Hardware event. This is not a software error.
Sometimes there are traces in the /var/log/messages:
Jan 8 08:30:27 Hostname kernel: Pid: 30350, comm: rgmanager Tainted: G W --------------- 2.6.32-358.el6.x86_64 #1 Dell Inc. PowerEdge R910/0NCWG9
Jan 8 08:30:27 Hostname kernel: RIP: 0010:[<ffffffff8150ffce>] [<ffffffff8150ffce>] _spin_lock+0x1e/0x30
Jan 8 08:30:27 Hostname kernel: RSP: 0018:ffff8820c05cdd10 EFLAGS: 00000283
Jan 8 08:30:27 Hostname kernel: RAX: 0000000000003964 RBX: ffff8820c05cdd10 RCX: 0000000000000000
Jan 8 08:30:27 Hostname kernel: RDX: 000000000000395f RSI: 000000000000001b RDI: ffffffff81e227e8
Jan 8 08:30:27 Hostname kernel: RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
Jan 8 08:30:27 Hostname kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8810685602d8
Jan 8 08:30:27 Hostname kernel: R13: 0000000000000000 R14: ffff883080010e40 R15: 0000000000000000
Jan 8 08:30:27 Hostname kernel: FS: 00007f3e81a20700(0000) GS:ffff8830b8880000(0000) knlGS:0000000000000000
Jan 8 08:30:27 Hostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 8 08:30:27 Hostname kernel: CR2: 00000000027477b0 CR3: 00000010671a6000 CR4: 00000000000007e0
Jan 8 08:30:27 Hostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 8 08:30:27 Hostname kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 8 08:30:27 Hostname kernel: Process rgmanager (pid: 30350, threadinfo ffff8820c05cc000, task ffff8820be965540)
Jan 8 08:30:27 Hostname kernel: Stack:
Jan 8 08:30:27 Hostname kernel: ffff8820c05cdd40 ffffffff8104b8d0 ffff8820c05cdd60 ffff884068122400
Jan 8 08:30:27 Hostname kernel: <d> ffff883a408fc040 ffff883a408fc040 ffff8820c05cdd60 ffffffff8106b179
Jan 8 08:30:27 Hostname kernel: <d> ffff884068122400 ffff881066d31440 ffff8820c05cdde0 ffffffff8106b879
Jan 8 08:30:27 Hostname kernel: Call Trace:
Jan 8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan 8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan 8 08:30:27 Hostname kernel: Code: 00 00 00 01 74 05 e8 b2 33 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Jan 8 08:30:27 Hostname kernel: Call Trace:
Jan 8 08:30:27 Hostname kernel: [<ffffffff8104b8d0>] ? pgd_alloc+0x50/0x130
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b179>] ? mm_init+0x139/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106b879>] ? dup_mm+0xa9/0x520
Jan 8 08:30:27 Hostname kernel: [<ffffffff81061d03>] ? sched_autogroup_fork+0x63/0xa0
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106cb6f>] ? copy_process+0xd5f/0x1450
Jan 8 08:30:27 Hostname kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109bfb4>] ? hrtimer_nanosleep+0xc4/0x180
Jan 8 08:30:27 Hostname kernel: [<ffffffff8109ae00>] ? hrtimer_wakeup+0x0/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Jan 8 08:30:27 Hostname kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Jan 8 08:30:39 Hostname kernel: BUG: soft lockup - CPU#3 stuck for 67s! [sshd:4711]
......
There could also be error records in the /var/mcelog as the below:
MCE 0
CPU 2 BANK 9
TIME 1388666356 Thu Jan 2 20:39:16 2014
MCG status:
MCi status:
Uncorrected error
Error enabled
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS b00000000800009f MCGSTATUS 0
MCGCAP 1000c18 APICID 80 SOCKETID 2
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
.....
- A cronjob running
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelogresults in the following, reoccuring errors:
TIME 1320670862 Mon Nov 7 14:01:02 2011
MCG status:
MCi status:
Corrected error
Error enabled
MCi_MISC register valid
MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error
<16:2> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS 9800004000020e0f MCGSTATUS 0
MCGCAP 1000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 46
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 0
MISC 1
- Why do we see a lot of MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR in mcelog?
/var/log/messagesand/var/log/mcelogcontain messages similar to:
TIME 1336064652 Fri May 4 01:04:12 2012
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d00002c0000a008f MCGSTATUS 0
MCGCAP 1000c18 APICID 40 SOCKETID 1
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 8 BANK 9
Resolution
- This is most likely a hardware issue, contact your hardware vendor to resolve it. If one or multiple message(s) appeared in the logs around the event, they should given to the vendor as they can help to identify the issue.
- A Machine Check Exception (MCE) is an error that occurs when a computer's CPU detects a hardware problem. Typically, the impending hardware failure will cause the kernel to panic in order to protect against data corruption.
- This could potentially be a BIOS issue, updating the BIOS to the latest revision can resolve false Machine Check Event (MCE) positives.* The tool
mcelogcan be run to generate a human readable summary of the error in/var/log/mcelog - We suggest engaging the hardware vendor for further troubleshooting and diagnosis;
- Run a memory test by following How to check if system RAM is faulty in Red Hat Enterprise Linux?
Root Cause
The following list of possible root causes is not exhaustive, but likely to cover most cases:
- Faulty memory DIMM.
- Faulty memory controller (Usually onboard).
- Faulty memory lines on motherboard.
- Faulty BIOS.
- Overheating system.
- RAM latent junction failure (static discharge from a user).
- Power supply issues or short circuits.
Information printed by the kernel (the line printed immediately before the panic message) comes from the hardware and should be provided to the hardware support person for analysis. It might look like this:
CPU 10: Machine Check Exception: 4 Bank 0: b200000410000800
[Hardware Error]: CPU xx: Machine Check Exception: 7 Bank 2: bd800xxxx100134
[Hardware Error]: RIP 10:<ffffffff9080f76e> {copy_user_enhanced_fast_string+0xe/0x40}
[Hardware Error]: TSC 88xxx576be2 ADDR 2ffxxx40 MISC 86 PPIN a7axxxx9c05c32
[Hardware Error]: PROCESSOR 0:5xxx4 TIME 1770171058 SOCKET x APIC d4 microcode 2007006
[Hardware Error]: Run the above through 'mcelog --ascii'
[Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[851402.672386] Kernel panic - not syncing: Fatal machine check
Diagnostic Steps
-
Restart the machine and access the BIOS configuration software, you may see evidence of failed hardware.
-
You may see the error slightly modified based on the BIOS and chipset supplier, e.g. "Hardware event. This is not a software error."
-
The errors may appear in the mcelog only.
-
Are all mcelog messages referring to the same piece of hardware, i.e.
CPU 8 BANK 9? Do other hardware reports unusual, for example is a single CPU core beeing reported with a different clock speed? -
Look for the phrase "Machine Check Exception" in the log just before the panic message. If this message occurs, the rest of the panic message is of no interest.
-
Vmcore analysis
$crash /path/to/2.6.18-128.1.6.el5/vmlinux vmcore
..
PANIC: "Kernel panic - not syncing: Uncorrected machine check"
..
crash> log
...
CPU 0: Machine Check Exception: 7 Bank 4: b40000000005001b
RIP 10:<ffffffff8006b2b0> {default_idle+0x29/0x50}
TSC bc34c6f78de8f ADDR 17fe30000
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check
- Process error through
mcelog --ascii, specify--k8for events that are for AMD processors and--p4for a Pentium 4 or Xeon. This resulting information might be helpful to your hardware vendor.
$ cat > mcelog.txt
CPU 0: Machine Check Exception: 7 Bank 4: b40000000005001b
RIP 10:<ffffffff8006b2b0> {default_idle+0x29/0x50}
TSC bc34c6f78de8f ADDR 17fe30000
[ctrl]+[d]
$ mcelog --ascii --k8 < mcelog.txt
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC bc34c6f78de8f
RIP 10:ffffffff8006b2b0
Northbridge GART error
bit61 = error uncorrected
TLB error 'generic transaction, level generic'
STATUS b40000000005001b MCGSTATUS 7
RIP: default_idle+0x29/0x50}
If we view the hexidecimal value of the status, we can then more easily evaluate the hardware error.
crash> px mcelog | head -n 10 | grep status
status = 0xb200000000000005,
crash> eval -b 0xb200000000000005
hexadecimal: b200000000000005
decimal: 12826251738751172613 (-5620492334958379003)
octal: 1310000000000000000005
binary: 1011001000000000000000000000000000000000000000000000000000000101
bits set: 63 61 60 57 2 0
The bits set, provide an overview of the processor state, and which of the other registers are meaningful:
63: VAL (Valid error report)
61: UC (Error not corrected)
60: EN (Error reporting enabled)
57: PCC (Processor state corrupted by error)
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.