Mcelog Hardware Error. This Is *not* A Software Problem
Contents |
Red Hat Certificate System Red Hat Satellite Subscription Asset Manager Red Hat Update Infrastructure Red Hat Insights Ansible Tower by Red Hat Cloud Computing memory scrubbing error Back Red Hat CloudForms Red Hat OpenStack Platform Red Hat Cloud
Mcelog: Cpu + Bank
Infrastructure Red Hat Cloud Suite Red Hat OpenShift Container Platform Red Hat OpenShift Online Red Hat OpenShift Dedicated hardware error machine check events logged redhat Storage Back Red Hat Gluster Storage Red Hat Ceph Storage JBoss Development and Management Back Red Hat JBoss Enterprise Application Platform Red Hat JBoss Data Grid Red Hat JBoss /var/log/mcelog Web Server Red Hat JBoss Portal Red Hat JBoss Operations Network Red Hat JBoss Developer Studio JBoss Integration and Automation Back Red Hat JBoss Data Virtualization Red Hat JBoss Fuse Red Hat JBoss A-MQ Red Hat JBoss BPM Suite Red Hat JBoss BRMS Mobile Back Red Hat Mobile Application Platform Services Back Consulting Technical Account Management Training & Certifications
Transaction: Memory Read Error
Red Hat Enterprise Linux Developer Program Support Get Support Production Support Development Support Product Life Cycle & Update Policies Knowledge Search Documentation Knowledgebase Videos Discussions Ecosystem Browse Certified Solutions Overview Partner Resources Tools Back Red Hat Insights Learn More Red Hat Access Labs Explore Labs Configuration Deployment Troubleshooting Security Additional Tools Red Hat Access plug-ins Red Hat Satellite Certificate Tool Security Back Product Security Center Security Updates Security Advisories Red Hat CVE Database Security Labs Resources Overview Security Blog Security Measurement Severity Ratings Backporting Policies Product Signing (GPG) Keys Community Back Discussions Red Hat Enterprise Linux Red Hat Virtualization Red Hat Satellite Customer Portal Private Groups All Discussions Start a Discussion Blogs Customer Portal Red Hat Product Security Red Hat Access Labs Red Hat Insights All Blogs Events Customer Events Red Hat Summit Stories Red Hat Subscription Benefits You Asked. We Acted. Open Source Communities Subscriptions Downloads Support Cases Account Back Log In Register Red Hat Account Number: Account Details Newsletter and Contact Preferences User Management Account Maintenance My Profile Notifications Help Log Out Langua
Flow References For Developers: Testing Logfile format Client protocol BIOS support Code README Frequently asked questions How do I report bugs in mcelog? Here is this machine check output. Please mcelog centos tell me what it means I have this corrected error message. Is mcelog example my system broken? I inject errors, but nothing happens How do I get an overview of what errors
Mcelog Redhat
happened on the system? How do I enable memory error reporting on SLES11-SP1? How do I decode fatal machine checks? How do I "run through mcelog --ascii"? How do I log https://access.redhat.com/solutions/67599 fatal machine checks to disk? On what systems does DMI DIMM decoding work? I get "Cannot open /dev/mem for DMI decoding" I get "failed to prefill DIMM database from DMI data" How do I enable corrected memory error reporting on Intel Xeon 7500,6500,E7 series systems? How does mcelog compare to EDAC? I get "machine check events logged"? I get "kernel hardware http://www.mcelog.org/faq.html error no human readable mce decoding support on this cpu type" Can you release mcelog? I get a "only decoding architectural errors" message. Does mcelog log all errors? mcelog does not start on newer AMD systems anymore Can I configure mcelog to send an email on each hardware error On SUSE systems I see "mcelog: SMTP server problem" messages mcelog on my old Linux distribution (RHEL 4 or similar vintage) reports wrong CPUs? How do I report bugs in mcelog? Please send them to the maintainer (see contact ) There is currently no mcelog specific mailing list. This is for bugs in mcelog itself, not for asking what is wrong with your hardware. Here is this machine check output. Please tell me what it means You have to ask your hardware vendor. Linux and mcelog developers cannot do hardware support for you. A machine check is a hardware problem and not a software problem. Such questions will be ignored. An exception are crashes or problems in the actual error reporting. Please report those. If you're doing over clocking or otherwise runni
systemsStorageMicroHPC WorkstationsSoftwareeQUEUE – Our innovative web-based job submission tool.ACT Utils – Full featured cluster management software.Breakin – Open-source full featured hardware testing and diagnostics.ServicesACTnowHPC – On Demand HPC Cloud http://www.advancedclustering.com/act-kb/what-are-machine-check-exceptions-or-mce/ ComputingOur servicesRequest a quote CloseTechIntel Xeon BroadwellKnights Landing - http://serverfault.com/questions/447912/how-do-i-interpret-mce-messages New Intel Xeon PhiGPU ComputingAMD OpteronInfiniband CloseSupportSupport requestWarrantyKnowledge baseDownloadsCustomer portal CloseIndustriesEducationGovernmentEngineeringLife sciencesFinanceClimate and weatherEnergyManufacturing CloseBlog Close ACT knowledge base KB CategoriesGetting Support (3)Hardware (1)Areca Raid Arrays (3)Infiniband (8)LSI Raid Arrays (7)Nvidia Graphics Cards (0)Power (1)Racks (2)Troubleshooting (8)Software (0)ACT hardware error Utilities (4)HPC apps & benchmarks (2)Linux (1)Schedulers (0)Open Grid Scheduler (Grid Engine) (1)TORQUE (1)Tech Tips (21)Search the KB Need Assistance?Support ticketName* First Last Company*Email* PhoneSerial numberPlease enter your system's serial number. This will expedite the handling of your ticket.Problem*Detailed description*Please make sure you are detailed as possible in mcelog hardware error. your description above. Please include serial numbers, order numbers, or any other details that can help us resolve your issue as quick as possible.Attachments Drop files here or Include any screenshots or log files that will make your issue easier to diagnose.NameThis field is for validation purposes and should be left unchanged. Submit a support ticketWhat are Machine Check Exceptions (or MCE)?Last update: August 18, 2014Categories:Hardware / TroubleshootingIf you are seeing messages in your system logs that state "Machine Check Event logged" this could be an indication of a hardware problem or failure.A machine check exception is an error detected by your system's processor. There are 2 major types of MCE errors, a notice or warning error, and a fatal exception. The warning will be logged by a "Machine Check Event logged" notice in your system logs, and can be later
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company Business Learn more about hiring developers or posting ads with us Server Fault Questions Tags Users Badges Unanswered Ask Question _ Server Fault is a question and answer site for system and network administrators. Join them; it only takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top How do I interpret MCE messages? up vote 10 down vote favorite 2 I've noticed a bunch of errors that just recently appeared in /var/log/messages on one of our servers (below). However, the mce client seems to be less certain of the error source than the decoded entries in syslog. Is there some sort of key to use in order to interpret the MCE output? Nov 12 04:19:19 areion kernel: [14698753.176035] Machine check events logged Nov 12 04:19:19 areion mcelog: HARDWARE ERROR. This is *NOT* a software problem! Nov 12 04:19:19 areion mcelog: Please contact your hardware vendor Nov 12 04:19:19 areion mcelog: MCE 0 Nov 12 04:19:19 areion mcelog: CPU 0 BANK 8 Nov 12 04:19:19 areion mcelog: MISC 640738dd0009159c ADDR 96236c6c0 Nov 12 04:19:19 areion mcelog: TIME 1352711959 Mon Nov 12 04:19:19 2012 Nov 12 04:19:19 areion mcelog: MCG status: Nov 12 04:19:19 areion mcelog: MCi status: Nov 12 04:19:19 areion mcelog: MCi_MISC register valid Nov 12 04:19:19 areion mcelog: MCi_ADDR register valid Nov 12 04:19:19 areion mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Nov 12 04:19:19 areion mcelog: Transaction: Memory read error Nov 12 04:19:19 areion mcelog: STATUS 8c0000400001009f MCGSTATUS 0 Nov 12 04:19:19 areion mcelog: MCGCAP 1c09 APICID 20 SOCKETID 1 Nov 12 04:19:19 areion mcelog: CPUID Vendor Intel Family 6 Model 44 All errors seem to be connected with the same memory bank: areion:~# awk -F'mcelog:' '/mcelog:.*BANK/{ print $2; }' < /var/log/messages |uniq CPU 0 BANK 8 I have the mcelog daemon running, and when I check for error information, it doesn't seem to know where the errors are coming from. Only that they are associated with CPU0 (we only have one CPU in this box): Memory errors SOCKET 1 CHANNEL any DIMM any corrected memory errors: 77 total 77 in 24h uncorrected memory errors: 0 total 0 in 24h Per page corrected memory statistics: 359ffc000: total 2 2 in 24h online 3b93cc000: total 2 2 in 24h online 3ce45c000: total 2 2 in 24h online 96236c000: total 20 20 in 24h online triggered 96545c000: total 9 9 in 24h online 96a82c000: total 9 9 in 24h online 96