Extended Error Code Ecc Chipkill X4 Error
May 2007 12:22:39 -0600 (MDT) Hi, I have a 4-way Opteron 870 system, with 16 2GB DIMMs and a Tyan Thunder K8QS Pro (S4882) motherboard. It has been crashing, and there are entries like this in the messages log: May 9 22:57:47 monolith kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) May 9 22:57:47 monolith kernel: MC1: CE page 0x240cb0, offset 0x700, grain 8, syndrome 0x11c1, row 2, channel 0, label "": k8_edac May 9 22:57:47 monolith kernel: MC1: CE - no information available: k8_edac Error Overflow set May 9 22:57:47 monolith kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error or (I guess equivalently) in dmesg: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) MC1: CE page 0x2557b8, offset 0x0, grain 8, syndrome 0x4c58, row 2, channel 0, label "": k8_edac MC1: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error Some Googling suggests that the problem is likely a flaky DIMM. But how to tell which one? Can any kernel or hardware gurus out there let me know if the error messages above allow me to locate the potentially bad memory stick? Note that every set of log entries includes "row 2, channel 0". Both the mobo and OS have NUMA enabled. It's running the 2.6.9-55.ELsmp kernel. Thanks for any suggestions, Peter Ruprecht U. of Colorado Follow-Ups: Re: locating bad memory From: Paul Krizak Re: locating bad memory From: Kay Diederichs [Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]
here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company Business Learn more about hiring developers or posting ads with us Unix & Linux Questions Tags Users Badges Unanswered Ask Question _ Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. Join them; it only https://www.redhat.com/archives/nahant-list/2007-May/msg00131.html takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top OS errors : kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 [duplicate] up vote 0 down vote favorite This question already has an answer here: Does kernel: EDAC MC0: UE page 0x0 http://unix.stackexchange.com/questions/91714/os-errors-kernel-edac-k8-mc0-extended-error-code-ecc-chipkill-x4 point to bad memory, a driver, or something else? 1 answer We noticed the server crashes with below errors. Not sure it is related to any defected piece of the hardware or totally not related to Server detail:Red Hat Enterprise Linux ES release 4 (Nahant Update 6) [root@athena log]# uname -a Linux athena.nsdecatur.local 2.6.9-67.0.7.ELsmp #1 SMP Wed Feb 27 04:47:23 EST 2008 x86_64 x86_64 x86_64 GNU/Linux messages Sep 17 15:08:16 athena kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) Sep 17 15:08:16 athena kernel: MC0: CE page 0x2c2766, offset 0xb10, grain 8, syndrome 0xac08, row 1, channel 0, label "": k8_edac Sep 17 15:08:16 athena kernel: MC0: CE - no information available: k8_edac Error Overflow set Sep 17 15:08:16 athena kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error Sep 17 15:08:17 athena su(pam_unix)[19579]: session opened for user oracle by (uid=0) Sep 17 15:08:17 athena su(pam_unix)[19579]: session closed for user oracle Sep 17 15:08:17 athena su(pam_unix)[19634]: session opened for user oracle by (uid=0) Sep 17 15:08:1
Mar29,2010,6:40AM Post #1 of 5 (2494 views) Permalink EDAC: Is it possible to http://www.gossamer-threads.com/lists/linux/kernel/1207815 calculate which piece of memory is bad? Hello, I see the following errors: EDAC MC0: CE page 0x8abba, offset 0xa10, grain 8, syndrome 0x4758, row 0, channel 0, label "": k8_edac EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus extended error error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) Is it possible to use the page or offset to calculate which DIMM is having a problem? Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo [at] vger More majordomo extended error code info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ jkosin at intcomgrp Mar29,2010,7:08AM Post #2 of 5 (2463 views) Permalink Re: EDAC: Is it possible to calculate which piece of memory is bad? [In reply to] On 3/29/2010 9:50 AM, Justin Piszcz wrote: > Hello, > > I see the following errors: > > EDAC MC0: CE page 0x8abba, offset 0xa10, grain 8, syndrome 0x4758, row > 0, channel 0, label "": k8_edac > EDAC MC0: CE - no information available: k8_edac Error Overflow set > EDAC k8 MC0: extended error code: ECC chipkill x4 error > EDAC k8 MC0: general bus error: participating processor(local node > origin), time-out(no timeout) memory transaction type(generic read), mem > or i/o(mem access), cache level(generic) > > Is it possible to use the page or offset to calculate which DIMM is > having a > problem? > > Justin. > Theoretically, YES. However, you would have to have some important information: 1) The number and size of each memory stick in the machine. 2) The physical location