Correctable Dimm Memory Error
Contents |
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company correctable memory error rate exceeded for dimm Business Learn more about hiring developers or posting ads with us Server Fault Questions Tags
Correctable Memory Error Rate Exceeded For Dimm B2
Users Badges Unanswered Ask Question _ Server Fault is a question and answer site for system and network administrators. Join them; it only correctable memory error rate exceeded for dimm dell takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top How seriously should I take ECC correctable error warnings? correctable memory error rate exceeded for dimm a2 up vote 7 down vote favorite I have a pile of Sun X2200-M2 servers. These servers have ECC memory. In some of these servers, I am getting warnings in the eLOM about "correctable ECC errors detected", eg: # ssh regress11 ipmitool sel elist 1 | 05/20/2010 | 14:20:27 | Memory CPU0 DIMM2 | Correctable ECC | Asserted 2 | 05/20/2010 | 14:33:47 | Memory CPU0 DIMM2 | Correctable ECC | Asserted ...some more frequently than others.
Correctable Memory Error Rate Exceeded For Dimm A1
The kernel on this particular system is throwing EDAC errors as well, although with far more frequency than the eLOM is recording ECC events: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) MC0: CE page 0x42a194, offset 0x60, grain 8, syndrome 0xf654, row 4, channel 1, label "": k8_edac MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) MC0: CE page 0x48cb94, offset 0x10, grain 8, syndrome 0xf654, row 5, channel 1, label "": k8_edac MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error Now if the server is detecting Uncorrectable ECC, the system resets, so clearly that's bad and removing/replacing the identified stick or pair corrects the issue. But I am thinking that if the error is Correctable, then there's no immediate issue -- I can treat this as a warning and be prepared to pull the stick/pair if an uncorrectable error starts occurring? ecc share|improve this question asked May 21 '10 at 15:50 David Mackintosh 11.5k32967 add a comment| 1 Answer 1 active oldest
in iDRAC, OpenManage Server administrator and LCD display This article discusses PowerEdge memory errors in iDRAC, OpenManage Server Administrator and LCD display. Issue Memory errors can show in a number correctable memory error rate exceeded for dimm b3 of ways on your system, and might vary depending on correctable memory error rate exceeded for dimm a3 the age of your system (system generation). There might also be slight variations based on
Correctable Memory Error Rate Exceeded For Dimm B4
your system firmware levels. The error messages can appear in one or more of BIOS message on post, iDRAC logs, OpenManage System Administrator (OMSA) logs, http://serverfault.com/questions/144151/how-seriously-should-i-take-ecc-correctable-error-warnings System LCD display or in the Operating system. Many of these errors can also be prevented by ensuring your firmware levels are up to date. Note: If the system is new, or have been recently moved, some components, including the memory could have become incorrectly seated due to the vibrations, and http://www.dell.com/support/article/us/en/04/SLN292634/en all memory modules and other components should be reseated (taken out, and put back in) before continuing troubleshooting. For other errors, see the separate documents for Memory errors on post. For some systems without an LCD panel, there will be status lights available, check PowerEdge system LED Status light indicator Solution: Jump straight to the messages for your system: 12th Generation (12G) PowerEdge systems 11th Generation (11G) PowerEdge systems 10th Generation (10G) PowerEdge systems 9th Generation (9G) PowerEdge systems Note: This article explains how to determine the generation of my Server PowerEdge? 12G PowerEdge memory errors LCD Error Code Error Message Details Action to resolve MEM0000 Persistent correctable memory errors detected on a memory device at location(s) . This is an early indicator of a possible future uncorrectable error. Reseat the memory modules. If error remains, swap test the memory module by swapping the module with another identical module in
computer data storage that can detect and correct the most common kinds of internal data corruption. ECC memory is used https://en.wikipedia.org/wiki/ECC_memory in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing. Typically, ECC memory maintains a memory system immune to single-bit https://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-56052 errors: the data that is read from each word is always the same as the data that had been written to it, even if one or more bits memory error actually stored have been flipped to the wrong state. Most non-ECC memory cannot detect errors although some non-ECC memory with parity support allows detection but not correction. Contents 1 Problem background 2 Solutions 3 Implementations 4 Cache 5 Registered memory 6 Advantages and disadvantages 7 References 8 External links Problem background[edit] Electrical or magnetic interference inside a computer correctable memory error system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.[2] Hence, the error rates increase rapidly with rising altitude; for example, compared to the sea level, the rate of neutron flux is 3.5 times higher at 1.5km and 300 times higher at 10–12km (the cruising altitude of commercial airplanes).[3] As a result, systems operating at high altitudes require special provision for reliability. As an example, the spacecraft Cassini–Huygens, launched in 1997, contains two identical flight recorders, each with 2.5gigabits of memory in the form of arrays of commercial DRAM chips. Thanks to built-i
Memory Error messages found in the event log report an incorrect DIMM slot as having an error - IBM eServer xSeries 445 Applicable countries and regions Source RETAIN tip: H182175 Symptom Correctable Memory Error messages found in the event log report an incorrect DIMM slot as having an error (e.g. reports Socket J07 instead of Socket J06). 1 ERR SERVPROC 06/10/04 17:43:21 PFA Alert, see preceding error in system error log. 2 ERR SMI Hdlr 06/10/04 17:43:21 00151500 Excessive Correctable Memory Errors Detected Chassis=01 SMP Expansion Module=02 Socket=J07 3 INFO SMI Hdlr 06/10/04 14:39:29 00151803 Memory ProteXion(C) Event Detected Chassis=01 SMP Expansion Module=02 Socket=J07 Symbol=05 If the Socket in actual error is J05, J06, J07, or J08, then the socket number that is reported as defective is one above the correct socket number. Affected configurations The system may be any of the following IBM eServer systems: an IBM eServer xSeries 445, type 8870, any model. The system is configured with the following option(s): an IBM Remote Supervisor Adapter, Option part number (p/n) 09N7585, Field Replacement Unit (FRU) p/n 36L9912. an IBM Remote Supervisor Adapter II-EXA, Option p/n 13N0382, FRU p/n 73P9246. The BIOS level(s) affected is: 42B or earlier Solution This behavior is corrected in the latest release of BIOS. Workaround If Correctable Memory Error messages are found in event log, check the Socket in error. If the socket is J05, J06, J07, or J08, then subtract 1 from the Socket in error and replace that DIMM. Additional infomation Correctable error messages incorrect for DIMMS J05, J06, J07, J08 on any SMP (Simple Muliti-Processor) board or any Chassis. BIOS code misidentified the DIMM socket in error. so this will be seen on J5-8 on any SMP Expansion Module in any position in any chassis. Applicable countries and regions Worldwide Back to top Document id:MIGR-56052 Last modified:2010-03-05 Copyright © 2016 IBM Corporation Sign in To access your authorized content and to customize your pages. Footer links Contact Privacy Terms of use Accessibility