Memory Controller Read Error
Contents |
NSXVirtual SAN vCenterFusionWorkstationvExpertVMware {code} CloudCredSubmit a Link Home > VMTN > VMware vSphere™ > VMware ESXi 4 > Discussions Please enter a title. You can not post a blank message. Please type your message and try again. 3 Replies Latest reply: May 18, 2011 mca recoverable error 11:08 PM by hona700506 Memory Controller Read Error - Hardware or Software
Mca Error Detected Via Polling
Problem? drhkocher May 10, 2011 5:44 AM Hi,we are running a Dell R310 (4 cores, 16GB RAM) server mca recoverable error ce memory controller error with ESXi 4.1.The system worked flawlessly for more than a year. Now we are getting the following error (see screen shot):Hardware (Machine) Error: Memory Controller Read Error.Later it says: machine check exception vmware PCPU3: 1 hardware errors seen since boot (0 errors corrected)Dell support replaced main board with memory controller, all diagnostics show no error, but the problem is still there.What could cause this problem? If it is a problem of the VM kernel, why didn't it show up before? Could this be only related to one of our VMs?We are
Machine Check Exception Decoder
running 3 VMs: W2003 Server, W2008 R2, and W7 32 bit. The VM in screen shot seems to be the 2003 Server.I am no export with VMWare, so any help would be appreciated.Hartmut 05-07-dell-server.jpg 273.7 K 2012Views Tags: none (add) hostContent tagged with host, vmdkContent tagged with vmdk, windowsContent tagged with windows, consoleContent tagged with console, errorContent tagged with error, esxi_4.1Content tagged with esxi_4.1 This content has been marked as final. Show 3 replies 1. Re: Memory Controller Read Error - Hardware or Software Problem? hona700506 May 18, 2011 10:09 PM (in response to drhkocher) Hi , everyoneI have the same problem .I run my ESX4.1 (8VM on it ) for 6 month and the problem start to appear (1 week a time.......)the error screen is ALL the same . (only difference is my Host CPU is Xeon X5680)I dont know it is hardware problem or software problem.My hardware is Supermicro 6026-3RF (it is in the vmware compatibility List). Don't have any SAN or NAS. only run at standalone Host.MY Host is CPU X
while under a certain CPU or Memory intensive load - or even at random. Most of the times without throwing a Purple Screen of Death so cmci signaling for patrol scrub ucr errors not supported you can at least have a notion about what went wrong. There is intel machine check exception decoder a VMware KB Article 1005184 concerning this issue, and it has been updated significantly since I have started to take
Sbridge: Handling Mce Memory Error
interest in these errors. UPDATE: I have published a new CPU Stress Test & Machine Check Error debugging article - check it out if you'd like to learn more. If you are "lucky", https://communities.vmware.com/thread/313267?tstart=0 you can see and decode yourself what preceded the crash. This is because both AMD and Intel CPUs have implemented something by the name of Memory Check Architecture. This architecture enables the CPUs to intelligently determine a fault that happens anywhere on the data transfer path during processor operation. This can capture Memory operation errors, CPU Bus interconnect errors, cache errors, and much more. https://vmxp.wordpress.com/2014/10/27/debugging-machine-check-errors-mces/comment-page-1/ How to determine what has been causing your system to fail? Read on. You will need to browse to Intel's website hosting the Intel® 64 and IA-32 Architectures Software Developer Manuals. There, download a manual named "Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 3A, 3B, and 3C: System Programming Guide". I highly recommend printing it, because you will be doing some back-and-forth seeking. Now, to get list of possible Machine Check Errors captured by the VMkernel, run the following in your SSH session with superuser privileges: cd /var/log;grep MCE vmkernel.log this will output something similar to this: Most of the times, the VMkernel decodes these messages for you - on this image you see that there are plenty of Memory Controller Read Errors. You can see more closely where the problem originates from: CMCI: This stands for Corrected Machine Check Interrupt - an error was captured but it was corrected and the VMkernel can keep on running. If this were to be an uncorrectalbe error, the ESXi host would crash. Logical CPU number where the MCE was detected: This particular host had Dual 8-Core Intel Xeon Processors with HyperThre
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of http://serverfault.com/questions/243047/memory-read-error-sever-hardware-error this site About Us Learn more about Stack Overflow the company Business Learn https://access.redhat.com/solutions/67599 more about hiring developers or posting ads with us Server Fault Questions Tags Users Badges Unanswered Ask Question _ Server Fault is a question and answer site for system and network administrators. Join them; it only takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can machine check answer The best answers are voted up and rise to the top 'Memory read error',Sever hardware error? up vote 1 down vote favorite hello I got a error about my server which is running CentOS5.5. MCE 20 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 8 TSC 6ab9ff9745f62 [at 2394 Mhz 9 days 1:50:52 uptime (unreliable)] machine check exception MISC cf36ad0100081186 ADDR 203376500 MCG status: MCi status: MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error STATUS 8c0000400001009f MCGSTATUS 0 what is the matter? is memory card error or memory controller error? linux hardware share|improve this question edited Mar 4 '11 at 3:01 Zypher♦ 30k34186 asked Mar 4 '11 at 2:38 wss8848 612 Reality check: theese days memory controlelrs are on the CPU, including then ewest Intel Sandy bridge. AMD does that for some time. Second it iclearly says that CPU 1 BANK 8 is faulty. By all mens this has a 95% chance of being a memory error and it is trivial to validate (swap bank 8 with another bank). –TomTom Mar 12 '12 at 9:23 add a comment| 1 Answer 1 active oldest votes up vote 1 down vote If you can restart the machine and get into the BIOS you may be able to see if there is a failed DIMM. Basically your OS has detected a faulty piece of hardware. You need to figure out what exactly that means. Most likely you should try and backup your data to an
Red Hat Certificate System Red Hat Satellite Subscription Asset Manager Red Hat Update Infrastructure Red Hat Insights Ansible Tower by Red Hat Cloud Computing Back Red Hat CloudForms Red Hat OpenStack Platform Red Hat Cloud Infrastructure Red Hat Cloud Suite Red Hat OpenShift Container Platform Red Hat OpenShift Online Red Hat OpenShift Dedicated Storage Back Red Hat Gluster Storage Red Hat Ceph Storage JBoss Development and Management Back Red Hat JBoss Enterprise Application Platform Red Hat JBoss Data Grid Red Hat JBoss Web Server Red Hat JBoss Portal Red Hat JBoss Operations Network Red Hat JBoss Developer Studio JBoss Integration and Automation Back Red Hat JBoss Data Virtualization Red Hat JBoss Fuse Red Hat JBoss A-MQ Red Hat JBoss BPM Suite Red Hat JBoss BRMS Mobile Back Red Hat Mobile Application Platform Services Back Consulting Technical Account Management Training & Certifications Red Hat Enterprise Linux Developer Program Support Get Support Production Support Development Support Product Life Cycle & Update Policies Knowledge Search Documentation Knowledgebase Videos Discussions Ecosystem Browse Certified Solutions Overview Partner Resources Tools Back Red Hat Insights Learn More Red Hat Access Labs Explore Labs Configuration Deployment Troubleshooting Security Additional Tools Red Hat Access plug-ins Red Hat Satellite Certificate Tool Security Back Product Security Center Security Updates Security Advisories Red Hat CVE Database Security Labs Resources Overview Security Blog Security Measurement Severity Ratings Backporting Policies Product Signing (GPG) Keys Community Back Discussions Red Hat Enterprise Linux Red Hat Virtualization Red Hat Satellite Customer Portal Private Groups All Discussions Start a Discussion Blogs Customer Portal Red Hat Product Security Red Hat Access Labs Red Hat Insights All Blogs Events Customer Events Red Hat Summit Stories Red Hat Subscription Benefits You Asked. We Acted. Open Source Communities Subscriptions Downloads Support Cases Account Back Log In Register Red Hat Account Number: Account Details Newsletter and Contact Preferences User Management Account Maintenance My Profile Notifications Help Log Out Language Back English español Deutsch italiano 한국어 français 日本語 português 中文 (中国) русский Customer Portal Search Products & Services Back View All Prod