Hardware Error Cache Level L3/gen
Contents |
Cache ECC Error Issues related to hardware problems Post Reply Print view Search Advanced search 8 posts • Page 1 of 1 vladixx mc4 error (node 3): l3 data cache ecc error Posts: 4 Joined: 2013/05/20 06:59:22 [SOLVED] L3 Cache ECC Error Quote mc4_status Postby vladixx » 2013/05/20 08:10:00 Hi, I have built a server for virtualisation, it is Supermicro chassis with
Kernel:[hardware Error]: Cache Level: L3/gen, Mem/io: Mem, Mem-tx: Rd, Part-proc: Src (no Timeout)
H8SGL-F motherboard, 64GB ECC Kingston RAM, AMD Opteron(tm) Processor 6320, installed minimal CentOS 6.4 Final on sw raid 1, made some modifications (selinux off, network settings, installation of
Cpu Rma
KVM, creation of LVM for virtuals...) and after a while, syslogd greeted me with this:Code: Select all[Hardware Error]: CPU:4 MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9d404013001c011b
[Hardware Error]: MC4_ADDR: 0x0000000001065d84
[Hardware Error]: Northbridge Error (node 1): L3 data cache ECC error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[Hardware Error]: Machine check events logged
Shall I be bothered? This northbridge error error occurred once while the server was idling... Maybe it could be cosmic radiation issue or whatever, but this will become production server soon so I need to be absolutely sure it is OK...Before I installed CentOS, I had Phoronix test suite running for 5 days, memtest86 as well (one week, ECC on/off, #9 test for 3 days...), no errors at all....Thanks for any advice... Top TrevorH Forum Moderator Posts: 16858 Joined: 2009/09/24 10:40:56 Location: Brighton, UK Re: L3 Cache ECC Error Quote Postby TrevorH » 2013/05/20 10:31:54 I'd report it to HP as a hardware error. Top vladixx Posts: 4 Joined: 2013/05/20 06:59:22 Re: L3 Cache ECC Error Quote Postby vladixx » 2013/05/20 11:28:02 TrevorH wrote:I'd report it to HP as a hardware error.well, I would like to report it to anyone, but this is not an HP machine, I've built it myself from Supermicro parts... Top TrevorH Forum Moderator Posts: 16858 Joined: 2009/09/24 10:40:56 Location: Brighton, UK Re: L3 Cache ECC Error Quote Postby TrevorH » 2013/05/20 14:58:55 My e
2014-06-21 16:08:24 AG Caesar Member From: Germany Registered: 2013-01-19 Posts: 15 System flooded by error messages on kernel 3.15 I have a AMD Phenom(tm) II X4 20 which basically is a AMD Phenom II x2 550 with two unlocked cored to
Dram Ecc Error Detected On The Nb
become an AMD Phenom II x4 955. Everything worked perfectly on all 4 Cores.The mc4_addr: only problem were these error Messages every 5 minutes. But as I noticed no problems I ignored them:Jun 21 15:31:23 localhost kernel: [Hardware Error]: MC2 Error: : EV error during data copyback. Jun 21 15:31:23 localhost kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 21 15:31:23 localhost kernel: [Hardware Error]: CPU:0 (10:4:2) MC2_STATUS[Over|CE|-|-|AddrV]: 0xd40000000000017a Jun 21 http://www.centos.org/forums/viewtopic.php?t=7473 15:31:23 localhost kernel: [Hardware Error]: MC2_ADDR: 0x00000000011c2d80 Jun 21 15:31:23 localhost kernel: [Hardware Error]: cache level: L2, tx: GEN, mem-tx: EV Jun 21 15:36:23 localhost kernel: [Hardware Error]: MC2 Error: : GEN parity/ECC error during data access from L2. Jun 21 15:36:23 localhost kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 21 15:36:23 localhost kernel: [Hardware Error]: CPU:0 (10:4:2) MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40041000000010a Jun 21 15:36:23 localhost kernel: [Hardware Error]: MC2_ADDR: 0x00000002fe700a40 Jun https://bbs.archlinux.org/viewtopic.php?id=183192 21 15:36:23 localhost kernel: [Hardware Error]: cache level: L2, tx: GEN, mem-tx: GENWith the update to Kernel 3.15 these messages began to occur much more often. I got 80000 lines of error logs in 5 minutes, journalctl said "coulden't log message, too may messages in too short time" (or something like that):The computer kept working but stuff like su or sudo did not work any more, I guess the kernel got flooded with error messages. The question is: How can I fix that? Is there a way to stop the reporting of those errors? Any other good idea? Offline #2 2014-06-21 16:30:54 x33a Forum Moderator Registered: 2009-08-15 Posts: 3,443 Website Re: System flooded by error messages on kernel 3.15 It really seems to be a hardware error: https://bugzilla.kernel.org/show_bug.cgi?id=43205 Last edited by x33a (2014-06-21 16:31:45) blog | github Offline #3 2014-06-21 16:53:24 AG Caesar Member From: Germany Registered: 2013-01-19 Posts: 15 Re: System flooded by error messages on kernel 3.15 Yes, it probably is. But I want to ignore it because my CPU works just fine. The Problem began when the error was printed every second or even more in kerlen 3.15 instead of every 300 seconds in every version before that. I want to decrease the logging frequency or just disable it. I know its not the best solution
here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company Business http://superuser.com/questions/502269/hardware-error-messages-from-syslogd Learn more about hiring developers or posting ads with us Super User Questions Tags Users Badges Unanswered Ask Question _ Super User is a question and answer site for computer enthusiasts and power users. Join them; it only takes http://forums.debian.net/viewtopic.php?f=5&t=85887 a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top Hardware error messages from syslogd up vote 5 down vote favorite hardware error I have a 64-core AMD server running CEntOS on which I was running a long job. In the midst of the output, I see these lines. It appears to be a memory error. How severe is this and what exactly does it indicate? Message from syslogd@heracles at Nov 7 21:00:02 ... kernel:[Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc10410040080a13 Message from syslogd@heracles at Nov 7 21:00:02 ... kernel:[Hardware Error]: Northbridge Error (node 4): DRAM ECC error detected on the NB. hardware error cache Message from syslogd@heracles at Nov 7 21:00:02 ... kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) memory share|improve this question edited Mar 3 at 12:55 Hennes 51.1k776121 asked Nov 7 '12 at 16:09 Farhat 15516 add a comment| 1 Answer 1 active oldest votes up vote 5 down vote accepted on the NB The NB is the North Bridge. Old computers used many chips. Eventually these got integrated in about 3 larger generic chips (386/486 time) and later in two. One of those dealt with the CPU, the RAM and other high speed devices. The other ('South bridge') dealt with slow peripherals). DRAM ECC error detected Dynamic memory is just main memory (as opposed to cache which is usually made from static memory). ECC is memory which is designed to detect and correct single bit corruption. The message you get is that the NB tried to read some memory, but detected that it was partially corrupt. In that case it can either shut down the machine (remember the old fashioned `Parity error: System halted'), or it can correct it, or it can ignore it. In this case it seems to have corrected it and it threw a warning. A single error on memory is no reason to panic. These things happen. Rarely, but they do happen. And with ECC you get a pr