Mce Status Bits Memory Controller Error
Contents |
while under a certain CPU or Memory intensive load - or even at random. Most of the times without throwing a Purple Screen of Death so you can at least have a notion about what went wrong. There
Cmci Signaling For Patrol Scrub Ucr Errors Not Supported
is a VMware KB Article 1005184 concerning this issue, and it has been updated significantly since machine check exception decoder I have started to take interest in these errors. UPDATE: I have published a new CPU Stress Test & Machine Check Error debugging article
Intel Machine Check Exception Decoder
- check it out if you'd like to learn more. If you are "lucky", you can see and decode yourself what preceded the crash. This is because both AMD and Intel CPUs have implemented something by the mce: 582: registering error recovery bh name of Memory Check Architecture. This architecture enables the CPUs to intelligently determine a fault that happens anywhere on the data transfer path during processor operation. This can capture Memory operation errors, CPU Bus interconnect errors, cache errors, and much more. How to determine what has been causing your system to fail? Read on. You will need to browse to Intel's website hosting the Intel® 64 and IA-32 Architectures Software Developer Manuals. There, download a manual named "Intel machine check exception error 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 3A, 3B, and 3C: System Programming Guide". I highly recommend printing it, because you will be doing some back-and-forth seeking. Now, to get list of possible Machine Check Errors captured by the VMkernel, run the following in your SSH session with superuser privileges: cd /var/log;grep MCE vmkernel.log this will output something similar to this: Most of the times, the VMkernel decodes these messages for you - on this image you see that there are plenty of Memory Controller Read Errors. You can see more closely where the problem originates from: CMCI: This stands for Corrected Machine Check Interrupt - an error was captured but it was corrected and the VMkernel can keep on running. If this were to be an uncorrectalbe error, the ESXi host would crash. Logical CPU number where the MCE was detected: This particular host had Dual 8-Core Intel Xeon Processors with HyperThreading enabled. For all other occurrences of this MCE, the cpu# was alternating between 0-15 this means the fault was always detected on the first cpu. Memory Controller Read/Write/Scrubbing error on Channel x: Means that the error was captured on a certain channel of the physical processor's NUMA node. Since there is a quad-channel memory controller used for this particular CPU, the channels would range from 0-3. This error is r
and PF Exception 14 separately. Not sure exactly what the root cause is, it could be caused by a faulty hardware or
Pf Exception 14 In World
software bugs. This server has been running for more than 6 mcelog months, and never had such issues. Plus we have not made any changes recently, so I doubt it
Psod
was caused by a faulty hardware. I ran a quick memory diagnose and found nothing. Currently, I leave it running and will see what will happen next. The purpose https://vmxp.wordpress.com/2014/10/27/debugging-machine-check-errors-mces/comment-page-1/ of posting it here is to take a note of this issue. I will review and update it when I have any clues. If you happened to see this before or you have a suggestion, please let me know. The updates will be added to the bottom. Part 1: ESXi version. This ESXi 5.0.0 update 2 Part 2: Error https://jackiechen.org/2013/11/11/esxi-purple-screen-message-interpretation/ messages. Part 3: The values in the CPU register at the time of the failure. Part 4: The physical CPU that was running an operation at the time of the failure Part 5: VMK uptime Part 6: Stack trace shows what the VMkernel was doing at the time of the failure Part 7: Core dump Updates [12/11/2013] The purple screen comes back with PF Exception 14. The stacks are different between the 3 purple screen failure, it should indicate the software is not hitting the same error. I still suspect it was caused a faulty hardware. A ticket has been opened to VMware. I found the a few MCE message saying “Memory Controller Error”. MCE (Machine Check Exception) is the output from the MCA (Machine Check Architecture) within the CPU triggered for detecting and reporting hardware errors. ~ # zcat /var/run/log/vmkernel.0.gz | grep MCE 2013-11-10T23:54:07.718Z cpu32:8224)MCE: 1278: CMCI on cpu32 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled. 2013-11-10T23:54:07.718Z cpu32:8224)MCE: 1282: Status bits: "Memory Controller Error." TSC: 104284 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE TSC: 104284 cp
? Ask a question, help others, and get answers from the community Discussions Start a thread and discuss today's topics with top experts Blogs Read the latest tech blogs written by experienced community members >>VIEW ALL POSTS The Real (and Virtual) Adventures of Nathan the http://itknowledgeexchange.techtarget.com/information-technology/how-to-troubleshoot-a-purple-screen-of-death-on-an-esxi-host/ IT Guy « BlackBerry 10 “Ready Offer” Free Device to Enterprise Think your Smartphone is https://community.spiceworks.com/topic/376500-network-transfer-slow-on-some-vms-2008r2-esxi-5-1 Fast? » Dec 18 2012 2:22PM GMT How to troubleshoot a Purple Screen of Death on an ESXi Host Nathan Simon Profile: Nathan Simon So your ESXi host is stuck at a PSOD or the “Purple Screen of Death”, what do you do? Well one would figure its hardware, but it also could be software related. Well I am going machine check to tell you how to download and review the error logs. Mind you the way I am going to explain it is if the host can boot up and be connected to either vCenter or VI Client. I will also show you a command you can run from the service console if you just want the support logs to send to VMware. Onto the Information. First you want to have the host back up and running, it machine check exception could be unstable at the moment, but you should have enough time to pull the support logs. Highlight the host in question. Click on File (top left of the VI Client), then click on “Export” then “Export System Logs” Your next screen will allow you to select the system logs you would like to export, I just select them all. Once you click next you can select where you want to export them to. Click next to start the export. Use a program like 7-Zip to extract the newly created file to a temporary location, once it is extracted you need to extract again, I know, they doubled up the compression, more so to keep the normal folk out! 🙂 Once everything is extracted you should see the following folders. The most important one is the "Core" folder which contains the kernel dump, the PSOD will purge what was in memory to a file called vmkernel-zdump.1 or something to that affect and place it in that directory. You will have to use something like NotePad++ to open the vmkernel-zdump file, once you do, you can pretty much search for “error” or “fail” or “panic” and you should find your issue. In my example, there is a memory bank error, see below. 2012-12-17T13:07:25.816Z cpu19:8211)MCE: 1278: CMCI on cpu19 bank9: Status:0x900000400800009f Misc:0x0 Addr:0x0: Valid.Err enabled. 2012-12-17T13:07:25.816Z cpu19:8211)MCE: 1282: Status bits: "Memory Controller Read Error." 201
only takes a few minutes. Join Now VSphere HP-ESXi 5.1.0 running on HP DL380p Gen8 server, with 2x Intel Xeon E5-2620 @2.00GHz and 64 GB RAM. 4x on-board 1GB NICs are connected to an HP Gigabit Switch, 4x 1GB NICs on an expansion card (factory fit) also plugged in to HP Gigabit switch. 7 of these ports are configured on a VSwitch with default settings which is set to serve VM traffic andManagement Network. 1 port is dedicated to VMotion traffic. We have a Veeam server running a nightly backup of all VMs. We have 10 VMs running plus VMware vCenter Server Appliance. 8 of the VMs are Server 2008 R2 with SP1, 1 is a Windows 7 x64 SP1 workstation and theother is a Linux-based security Appliance. 4 of the Server 2008 R2 VMs will transfer files across the network at between 80-100 MegaBYTES per second, with moderate CPU usage. However, the other 4 Server 2008 R2 VMs will max out at around 28-36 MegaBYTES per second, with 100% CPU usage. Steps I have taken: I have checked for common factors between the "slow" networking and "fast" networking Servers (herein to be referred to as "slow servers" and "fast servers" to save time) - there are none that I can see. Some of the slow servers have 2 vCores, some have 1 vCore. Likewise for the fast servers. They all have a variety of RAM allocated. There are no common services/roles between the servers, some have file services installed, some don't, one server has no roles installed at all. One of the slow servers has the E1000 network adapter installed, however, the 3 other slow servers have the VMXNET 3 network adapter installed. I have tried removing all the NICs from the expansion card from the VSwitch, which made no difference. I have tried dedicating a single port to a VM, which made no difference. Strangely - I noticed that if I made a hardware change to one of the fast servers, such as increasing the RAM, its network performance dropped on the next and subsequent boots. If I returned it to the original amount of RAM, the network speed returned to normal. However, making the same change again (or different changes, such as vCores from 1 to 2, or 2 to 1, or increasing or decreasing the RAM) would make the network transfers faster or slower, but not-predictably, i.e. making the same change twice would not necessarily change the speed reliably. I also found that changing a rand