Mce 1282 Status Bits Memory Controller Error
Contents |
and PF Exception 14 separately. Not sure exactly what the root cause is, it could be caused by a faulty hardware or software bugs. This server has been running for more than 6 months, and never had cmci signaling for patrol scrub ucr errors not supported such issues. Plus we have not made any changes recently, so I doubt it was
Machine Check Exception Decoder
caused by a faulty hardware. I ran a quick memory diagnose and found nothing. Currently, I leave it running and will see intel machine check exception decoder what will happen next. The purpose of posting it here is to take a note of this issue. I will review and update it when I have any clues. If you happened to see this before or mce: 582: registering error recovery bh you have a suggestion, please let me know. The updates will be added to the bottom. Part 1: ESXi version. This ESXi 5.0.0 update 2 Part 2: Error messages. Part 3: The values in the CPU register at the time of the failure. Part 4: The physical CPU that was running an operation at the time of the failure Part 5: VMK uptime Part 6: Stack trace shows what the VMkernel was doing at
Machine Check Exception Error
the time of the failure Part 7: Core dump Updates [12/11/2013] The purple screen comes back with PF Exception 14. The stacks are different between the 3 purple screen failure, it should indicate the software is not hitting the same error. I still suspect it was caused a faulty hardware. A ticket has been opened to VMware. I found the a few MCE message saying “Memory Controller Error”. MCE (Machine Check Exception) is the output from the MCA (Machine Check Architecture) within the CPU triggered for detecting and reporting hardware errors. ~ # zcat /var/run/log/vmkernel.0.gz | grep MCE 2013-11-10T23:54:07.718Z cpu32:8224)MCE: 1278: CMCI on cpu32 bank9: Status:0x900000400009008f Misc:0x0 Addr:0x0: Valid.Err enabled. 2013-11-10T23:54:07.718Z cpu32:8224)MCE: 1282: Status bits: "Memory Controller Error." TSC: 104284 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE TSC: 104284 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE 0:00:00:05.582 cpu0:8192)MCE: 186: Detected 24 MCE banks. MCG_CAP MSR:0x1000c18 0:00:00:06.572 cpu0:8192)MCE: 616: Fixed 12 MCE bank/CPU-package ownership settings 0:00:00:06.573 cpu0:8192)MCEIntel: 1331: Enabled CMCI signaling of uncorrected patrol scrub errors 0:00:00:06.573 cpu0:8192)MCEIntel: 1553: Registering Error recovery BH ~ # zcat /var/run/log/vmkernel.1.gz | grep MCE ~ # zcat /var/run/log/vmkernel.2.gz | grep MCE 0:00:00:05.583 cpu0:8192)MCE: 186: Detected 24 MCE banks. MCG_CAP MSR:0x1000c18 0:00:00:06.574 cpu0:8192)MCE: 616: Fixed 12 MCE bank/CPU-package ownership settings 0:00:00:06.575 cpu0:8192)MCEIntel: 1331: Enabled CMCI signaling of uncorrected patrol scrub errors 0:00:00:06.575 cpu0:8192)MCEIntel: 1553: Registering Error recovery BH TSC: 10442
while under a certain CPU or Memory intensive load - or even at random. Most of the times without throwing a Purple Screen of Death so you can at least have a notion pf exception 14 in world about what went wrong. There is a VMware KB Article 1005184 concerning this issue, and it mcelog has been updated significantly since I have started to take interest in these errors. UPDATE: I have published a new CPU Stress
Psod
Test & Machine Check Error debugging article - check it out if you'd like to learn more. If you are "lucky", you can see and decode yourself what preceded the crash. This is because both https://jackiechen.org/2013/11/11/esxi-purple-screen-message-interpretation/ AMD and Intel CPUs have implemented something by the name of Memory Check Architecture. This architecture enables the CPUs to intelligently determine a fault that happens anywhere on the data transfer path during processor operation. This can capture Memory operation errors, CPU Bus interconnect errors, cache errors, and much more. How to determine what has been causing your system to fail? Read on. You will need to browse to Intel's website https://vmxp.wordpress.com/2014/10/27/debugging-machine-check-errors-mces/comment-page-1/ hosting the Intel® 64 and IA-32 Architectures Software Developer Manuals. There, download a manual named "Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 3A, 3B, and 3C: System Programming Guide". I highly recommend printing it, because you will be doing some back-and-forth seeking. Now, to get list of possible Machine Check Errors captured by the VMkernel, run the following in your SSH session with superuser privileges: cd /var/log;grep MCE vmkernel.log this will output something similar to this: Most of the times, the VMkernel decodes these messages for you - on this image you see that there are plenty of Memory Controller Read Errors. You can see more closely where the problem originates from: CMCI: This stands for Corrected Machine Check Interrupt - an error was captured but it was corrected and the VMkernel can keep on running. If this were to be an uncorrectalbe error, the ESXi host would crash. Logical CPU number where the MCE was detected: This particular host had Dual 8-Core Intel Xeon Processors with HyperThreading enabled. For all other occurrences of this MCE, the cpu# was alternating between 0-15 this means the fault was always detected on the first cpu. Memory Controller Read/Write/Scrubbing error on Channel x: Means that the error was captured on a certain channel o
? Ask a question, help others, and get answers from the community Discussions Start a thread and discuss today's topics with top experts Blogs Read the latest tech blogs written by experienced community members The http://itknowledgeexchange.techtarget.com/information-technology/page/17/ Real (and Virtual) Adventures of Nathan the IT Guy PREV...13141516171819202122...3040506070...NEXTLAST January 9, 2013 1:38 PM Top 10 VMware Performance Tweaks Nathan Simon Profile: Nathan Simon Petri.co.il has been a very helpful site over the years, no matter what platform or application. Today I bring you Petri's top 10 Performance Tweaks, all of them are valid, but not all can/would be used in your environment. It sure doesn't machine check hurt to read this though. Below you will find the list in reverse order, and in fairness to the original author I can only post some details, the rest you can read in a link below. 10. Install VMware Tools in all virtual machines 9. Use the latest VM virtual hardware version 8. Run the latest version of vSphere 7. Utilize Distributed Resource Scheduler (DRS) 6. Consider machine check exception SSD Another way to utilize SSD with vSphere to improve performance is to utilize the vSphere host-caching feature such that if a host is low on RAM, the host-swapping option is used to swap memory to low-latency disks that you specify (such as SSD). Now, if you use swap to host cache, remember that it's not the same as placing regular swap files on SSD-backed datastores -- the host still needs to create regular swap files. Even so, when you use swap to host cache, the speed of the storage where the host places regular swap files is less important. 5. Reduce VM snapshots Too many admins think that snapshots are for periodic backup purposes and leave many GB of snapshots on disks. Besides being a waste of disk space, VM snapshots slow down many things such as svMotion of the virtual disk, backups, disaster recovery, and more. 4. Utilize Storage I/O Control (SIOC) and Storage DRS (SDRS) 3. Replace your hardware 2. Right-size virtual machines Where underallocation can certainly cause performance issues, overallocation can also cause slowdowns for other VMs that can't gain access to the memory that they need. By knowing your application and monitoring its per
be down. Please try the request again. Your cache administrator is webmaster. Generated Thu, 20 Oct 2016 09:38:22 GMT by s_nt6 (squid/3.5.20)