Edac Error Overflow
Contents |
exported by these drivers in sysfs. With no options, edac-util will report any uncorrected error (UE) or corrected error (CE) information recorded by EDAC, along with any DIMM label information registered with EDAC. Options -h, --help
Edac-util
Display a summary of the command-line options. -q, --quiet Quiet mode. For some edac linux reports, edac-util will report corrected and uncorrected error counts for all MC, csrow, and channel combinations, even if the current count edac vs mcelog of errors is zero. The --quiet flag will suppress the display of any locations with zero errors, thus creating a more terse report. No output will be generated if there are zero total errors
Edac Sbridge Lost Memory Errors
currently recorded by EDAC. Additionally, the use of --quiet will suppress all informational and debug messages, displaying only fatal errors. -v, --verbose Increase verbosity. Multiple -v's may be used. -s, --status Displays the current status of EDAC drivers. edac-util will report whether it detects that EDAC drivers are loaded, and the number of memory controllers (MCs) found in sysfs. In verbose mode, the MC id and name of
Handling Mce Memory Error
each controller will also be printed. -r, --report=report,... Specify the report to generate. Currently, the available reports are default, simple, full, ue, and ce. These reports are detailed in the EDAC REPORTS section below. More than one report may be specified in a comma-separated list. Edac Reports default The default edac-util report is generated when the program is run without any options. If there are no errors logged by EDAC, this report will display "No errors to report." to stdout. Otherwise, error counts for each MC, csrow, channel combination with attributed errors are displayed, along with corresponding DIMM labels, if these labels have been registered in sysfs. The default report will also display any errors that do not have any DIMM information. These errors occur when errors are reported in the memory controller overflow register, indicating that more than one error occurred during a given EDAC poll cycle. It is usually obvious from which DIMM locations these errors were generated. simple The simple report reports total corrected and uncorrected errors for each MC detected on the system. It also displays a tally of total errors. With the --quiet option, only non-zero error counts are displayed. full The full report generates a l
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow edac-util: error: no memory controller data found. the company Business Learn more about hiring developers or posting ads with us Server Fault
Edac Mc0
Questions Tags Users Badges Unanswered Ask Question _ Server Fault is a question and answer site for system and network administrators. Join edac wiki them; it only takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top ECC chipkill errors: which DIMM? https://linux.die.net/man/1/edac-util up vote 8 down vote favorite 8 We often get DIMMs in our servers going bad with the following errors in syslog: May 7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) May 7 09:15:31 nolcgi303 kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac May 7 http://serverfault.com/questions/5672/ecc-chipkill-errors-which-dimm 09:15:31 nolcgi303 kernel: MC0: CE - no information available: k8_edac Error Overflow set May 7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error We can use the HP SmartStart CD to determine which DIMM has the error but that requires taking the server out of production. Is there a cunning way to work out which DIMM's bust while the server is up? All our servers are HP hardware running RHEL 5. linux hardware memory ecc share|improve this question asked May 7 '09 at 8:20 markdrayton 2,09911422 memtest86+ but I suppose you can't run it while RHEL is running –Alex Bolotov May 7 '09 at 9:32 Are you running the HP SIM homepage (or full SIM for that matter actually) on the box? if so that'll offer a lot more info. Otherwise I'd need to know a bit more information about the memory offset from a more detailed error. –Chopper3 May 7 '09 at 10:08 We're not running any of the HP SIM stuff on the box as we generally find it more trouble than it's worth. If we can't work out which DIMM is dead while online it's not a showstopper -- I'm just on the lookout for ways to save time :~) –markdrayton May 7 '0
Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies http://serverfault.com/questions/648240/how-can-i-find-which-memory-have-ce-error of this site About Us Learn more about Stack Overflow the company Business Learn more about hiring developers or posting ads with us Server Fault Questions Tags Users Badges Unanswered Ask http://k12osn.redhat.narkive.com/MNU42XfP/help-thin-kernel-edac-mc1-ce-no-information-available-k8-edac-error-overflow-set Question _ Server Fault is a question and answer site for system and network administrators. Join them; it only takes a minute: Sign up Here's how it works: Anybody can ask memory error a question Anybody can answer The best answers are voted up and rise to the top How can I find which memory have CE error? up vote 8 down vote favorite 1 In /var/log/kern.log: kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0) This is edac log, one of the memory have ce error. I have read edac error overflow edac doc Dual channels allows for 128 bit data transfers to the CPU from memory. Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: Channel 0 Channel 1 =================================== csrow0 | DIMM_A0 | DIMM_B0 | csrow1 | DIMM_A0 | DIMM_B0 | =================================== =================================== csrow2 | DIMM_A1 | DIMM_B1 | csrow3 | DIMM_A1 | DIMM_B1 | =================================== and find the error channel: $ grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:144648966 /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0 and it should be mc0/csrow0/ch2, as the doc, the DIMM should be DIMM_C0, and can be found by dmidecode: But I can't find this DIMM, so I don't know which memory have problem: $ dmidecode -t memory | grep 'Locator: PROC' Locator: PROC 1 DIMM 2A Locator: PROC 1 DIMM 1D Locator: PROC 1 DIMM 4B Locator: PROC 1 DIMM 3E Locator: PROC 1 DIMM 6C Locator: PROC 1 DIMM 5F Locator: PROC 2 DIMM 2A Locator: PROC 2 DIMM 1D Locator: PROC 2 DIMM 4B Locator: PROC 2 DIMM 3E Locator: PROC 2 DIMM 6C Locator: PROC 2 D
zillion of these errors. The thinclient and the server is constantly crashing::Feb 18 04:05:03 thin kernel: EDAC MC1: CE - no information available:k8_edac Error Overflow setFeb 18 04:05:03 thin kernel: EDAC k8 MC1: extended error code: ECC errorFeb 18 04:05:04 thin kernel: EDAC k8 MC1: general bus error: participatingprocessor(local node origin), time-out(no timeout) memory transactiontype(generic read), mem or i/o(mem access), cache level(generic)Feb 18 04:05:04 thin kernel: EDAC MC1: CE page 0x12ee25, offset 0xd08, grain8, syndrome 0x10, row 1, channel 1, label "": k8_edacFeb 18 04:05:04 thin kernel: EDAC MC1: CE - no information available:k8_edac Error Overflow setFeb 18 04:05:04 thin kernel: EDAC k8 MC1: extended error code: ECC errorFeb 18 04:05:05 thin kernel: EDAC k8 MC1: general bus error: participatingprocessor(local node origin), time-out(no timeout) memory transactiontype(generic read), mem or i/o(mem access), cache level(generic)Feb 18 04:05:05 thin kernel: EDAC MC1: CE page 0x12ef6c, offset 0x28, grain8, syndrome 0x10, row 1, channel 1, label "": k8_edac Jim Christiansen 2007-02-21 20:02:15 UTC PermalinkRaw Message I've just finished a memory test with zero errors reported. Could thesemessages have something to do with the thin server processes and not actualserver system memory??I ask this because of the constant references in the reports including "thinkernel ..." in every line: ... thin kernel: EDAC MC1: CE - no inf...The actual messages log is growing to 290 megs before rolling over to a newlog file. I've got 2 gigs of errors now in two or three days worth of logs.Also, the server seems to fail every twenty minutes or so when under aheavier load 25+ students. Seems to run without failing for hour on alighter load...Ideas anyone??Post by Jim ChristiansenI've got a messages log file with a zillion of these errors. The think8_edac Error Overflow setFeb 18 04:05:03 thin kernel: EDAC k8 MC1: extended error code: ECC errorFeb 18 04:05:04 thin kernel: EDAC k8 MC1: general bus error: participatingprocessor(local node origin), time-out(no timeout) memory transact