Kernel Edac E752x Non-fatal Error Dram Controller
Contents |
Date: Fri, 24 Feb 2006 11:27:12 -0800 I am testing the RHEL4 U3 Beta on an Intel EM64T based system. This is the x86-64/EM64T version of the distribution. The install completed successfully, but linux edac upon reboot, the system panic's during rc.sysinit around "remounting root" or "No Software edac-util RAID found" (from dmraid -ay). The panic is: MC0: Uncorrected Error That's clearly from the new EDAC feature which was edac sbridge lost memory errors added in the release. I've tried two different motherboard/CPU sets and two completely different sets of RAM. None of this hardware has exhibited any problems in the past. So I'm fairly certain edac error this is a false positive. I tried several different ways to disable the "panic_on_ue" behavior on the kernel command line, but "edac_mc.panic_on_ue=0" didn't work, nor did any of the others. Ultimately, I had to boot into rescue mode and kill the edac modules with the following in /etc/modprobe.conf: alias e752x_edac /dev/null alias edac_mc /dev/null Then I was able to boot and the system appears to
Edac-util: Error: No Memory Controller Data Found.
be running without problems. Further, I am able to "insmod edac_mc panic_on_ue=0" and load e752x_edac without problems. The e752x_edac module does *not* log any memory errors after I manually load the modules. Now that I had the system up, I changed the /etc/modprobe.conf to read: options edac_mc panic_on_ue=0 and tried rebooting the system. Now the system boots and runs just fine except that the log is filling up with the attached error message. Clearly something isn't initialized or being read correctly. But after unloading and reloading the e752x_edac module, everything is fine: MC0: Removed device 0 for e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0) tolm = 20000, remapbase = ffc000, remaplimit = 0 MC0: Giving out device to e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0) And no further errors are reported. So it seems that the hotplug loading of e752x_edac in /etc/rc.sysinit (via kmodule) is causing things to be initialized badly. Perhaps there is a race condition of some kind between edac_mc and e752x_edac loading? What additional information and tests can I run to track down the root of the problem? I've searched bugzilla, but I haven't found any bugs *at all* against the RHEL4U3 Beta. Perhaps
Jonas Meurer
Edac Vs Mcelog
server with 16GB RAM (4 x 4GB modules). The server runs handling mce memory error since about three years. The system is Debian Lenny with 2.6.26 kernel, selfcompiled from linux-source-2.6.26 2.6.26-26lenny3. It's edac wiki not the first time, that these EDAC error messages appear. Actually, in the last three years, I got these errors every now and then. Sometimes only few errors where https://www.redhat.com/archives/nahant-beta-list/2006-February/msg00000.html logged, sometimes my logs were spammed with the errors for several days, but then it stopped again. Now, the messages keep spamming my log and console for more than three weeks already. A some days I get more than 36000 errors a day. It's noteable, that every DRAM-Bank from 0 to 7 is affected. Now I wonder, https://lists.debian.org/debian-user/2011/08/msg00963.html whether these are false positives (searching for the errors in the web revealed that these are quite common), or whether my RAM might be damaged. Unfortunately, running memtest86+ is not an option, as the server in question is a production server, and I don't have a second server for redundancy. Additionally, a slightly related question: How do I turn off the logging of these messages to console? It's impossible to work in a SSH session when the console is spammed with these logs. Neither setting kernel.printk, nor 'setterm -msg 0', 'dmesg -n1' or 'echo 1 > /proc/sysrq-trigger' do stop the logging flood to console. Did I miss anything, or is it simply impossible to stop console logging for this kind of kernel error messages. That would be very unfortunate. I already considered to recompile the kernel without EDAC i5000 driver in order to stop this annoyance, but I would prefer to fix the reason instead of fighting the symptoms. Here's an example error message: Aug 16 13:08:20 nibbler ke
| Threaded Open this post in threaded view ♦ ♦ | Report Content as Inappropriate http://ubuntu.5.x6.nabble.com/quot-Non-Fatal-Error-DRAM-Controller-quot-in-log-td5009564.html ♦ ♦ "Non-Fatal Error DRAM Controller" in log On one https://lists.ubuntu.com/archives/ubuntu-users/2013-January/266966.html of my machines running Ubuntu 12.04, logcheck keeps throwing out stuff like the following at seemingly random times: [135573.308051] EDAC e752x: Non-Fatal Error DRAM Controller [135573.308057] EDAC e752x: Non-Fatal Error DRAM Controller [135573.308065] EDAC MC0: CE page 0x48142, offset 0x980, grain memory error 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE [135573.308070] EDAC MC0: CE page 0x48142, offset 0x980, grain 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE [137676.320044] EDAC e752x: Non-Fatal Error DRAM Controller [137676.320050] EDAC e752x: Non-Fatal Error DRAM Controller [137676.320057] EDAC MC0: CE page kernel edac e752x 0x48142, offset 0x980, grain 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE [137676.320061] EDAC MC0: CE page 0x48142, offset 0x980, grain 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE [137677.320042] EDAC e752x: Non-Fatal Error DRAM Controller [137677.320053] EDAC MC0: CE page 0x48142, offset 0x980, grain 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE [138132.320106] EDAC e752x: Non-Fatal Error DRAM Controller [138132.320117] EDAC e752x: Non-Fatal Error DRAM Controller [138132.320124] EDAC MC0: CE page 0x48142, offset 0x980, grain 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE [138132.320129] EDAC MC0: CE page 0x48142, offset 0x980, grain 4096, syndrome 0x682, row 2, channel 0, label "": e752x CE AFAICT, everything is working. Are these warnings significant, or should I set logcheck to ignore them? Thanks, Adam -- ubuntu-users mailing list [hidden email] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users Steve Flynn Reply |
[ date ] [ thread ] [ subject ] [ author ] On 2013-01-16, Steve Flynn wrote: > On 16 January 2013 10:09, Adam Funk